OliveiraDS-hub / ChimeraTE

A pipeline to detect chimeric transcripts derived from genes and transposable elements.
GNU General Public License v3.0
18 stars 4 forks source link

Bowtie2 Alignment runs very long #17

Closed anna3106 closed 3 months ago

anna3106 commented 4 months ago

Hello,

I tried to perform ChimeraTE Mode 2 analyses and created all the input files as explained in the read.me. When starting the run, it creates the indices for both fasta files. However, the alignment with Bowties2 runs very long. I started 24hours ago using the default RAM and threads settings. It only created an .sam file for the first replicate which is slowly increasing in size. However, it still only reached 1.8MB in size after 24 hours of alignment to the referenceTE.fasta. Is the speed normal? How long does it usually take to align within ChimeraTE? Do you have an idea what slows the alignment down?

Thanks so much, Anna

OliveiraDS-hub commented 4 months ago

Dear Anna,

Definitely this speed isn't normal. The alignment time is performed with bowtie2, which relies 100% in the number of threads that you are providing. Depending on your genome size, and library size (both number of reads and read length), the default parameters can take long, but I would never expect to be this long for having a 1.8Mb sam file.

Can you provide me more information? Such as the genome size and your library size?

In addition, what's your hardware and operational system?

Finally, have you ever used bowtie2 out of ChimeraTE's context?

Thank you!

anna3106 commented 4 months ago

Dear Oliveira,

thanks for your reply. I was able to fix the bowtie2 running time. There was a small mistake in the refTE.fasta file. After fixing, the alignment works perfectly fine. I just ran into the same problem as Ifengel.

Tuesday 20/2/2024 - 17h:27] Creating bowtie2 index for TEs...
Done!
[Tuesday 20/2/2024 - 17h:27]    Creating bowtie2 index for transcripts...
Done!
[Tuesday 20/2/2024 - 17h:59]    Perfoming bowtie2 alignment for TEs...
19767663 reads; of these:
  19767663 (100.00%) were paired; of these:
    18979601 (96.01%) aligned concordantly 0 times
    284049 (1.44%) aligned concordantly exactly 1 time
    504013 (2.55%) aligned concordantly >1 times
    ----
    18979601 pairs aligned concordantly 0 times; of these:
      19394 (0.10%) aligned discordantly 1 time
    ----
    18960207 pairs aligned 0 times concordantly or discordantly; of these:
      37920414 mates make up the pairs; of these:
        37179257 (98.05%) aligned 0 times
        538048 (1.42%) aligned exactly 1 time
        203109 (0.54%) aligned >1 times
5.96% overall alignment rate
Done!
[Tuesday 20/2/2024 - 19h:10]    Perfoming bowtie2 alignment for transcripts...
19767663 reads; of these:
  19767663 (100.00%) were paired; of these:
    3208276 (16.23%) aligned concordantly 0 times
    1742592 (8.82%) aligned concordantly exactly 1 time
    14816795 (74.95%) aligned concordantly >1 times
    ----
    3208276 pairs aligned concordantly 0 times; of these:
      54777 (1.71%) aligned discordantly 1 time
    ----
    3153499 pairs aligned 0 times concordantly or discordantly; of these:
      6306998 mates make up the pairs; of these:
        4796417 (76.05%) aligned 0 times
        265528 (4.21%) aligned exactly 1 time
        1245053 (19.74%) aligned >1 times
87.87% overall alignment rate
Done!
[Tuesday 20/2/2024 - 20h:09]    Calculating transcripts expression...
Unable to calculate expression due to low rate of alignment! Including all transcripts to the downstream analysis...
Done!
[Tuesday 20/2/2024 - 20h:10]    Identifying chimeric reads...
Done!
Done!
[Tuesday 20/2/2024 - 20h:21]    Identifying chimeric transcripts with chimeric reads evidence...

I checked the files you were suggesting to Ifengle. The refTE.fasta seems to not give any problems. Yet, there is no File in any of the replicates in the folder within alignment/fpkm_counts/

head genes_expressed_IDs.lst:

ENST00000470827.3|ENSG00000125037.13|OTTHUMG00000128652.5|OTTHUMT00000339889.3|EMC3-203|EMC3|1584|protein_coding|
ENST00000335181.10|ENSG00000067225.21|OTTHUMG00000172709.6|OTTHUMT00000420056.2|PKM-202|PKM|2305|protein_coding|
ENST00000697140.1|ENSG00000147065.18|OTTHUMG00000021723.5|-|MSN-214|MSN|3880|protein_coding_CDS_not_defined|
ENST00000590996.6|ENSG00000188554.15|OTTHUMG00000180878.2|OTTHUMT00000453461.2|NBR1-206|NBR1|4613|protein_coding|
ENST00000394936.8|ENSG00000197746.15|OTTHUMG00000018429.6|OTTHUMT00000048553.3|PSAP-201|PSAP|2748|protein_coding|
ENST00000702822.1|ENSG00000291178.1|-|-|ENST00000702822|ENSG00000291178|617|lncRNA|
ENST00000371102.8|ENSG00000087460.29|OTTHUMG00000033069.23|OTTHUMT00000080425.2|GNAS-215|GNAS|3438|protein_coding|
ENST00000322203.7|ENSG00000198755.11|OTTHUMG00000014566.2|OTTHUMT00000040283.2|RPL10A-201|RPL10A|718|protein_coding|
ENST00000387347.2|ENSG00000210082.2|-|-|MT-RNR2-201|MT-RNR2|1559|Mt_rRNA|
ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|protein_coding|

head genes.bed

ENST00000470827.3|ENSG00000125037.13|OTTHUMG00000128652.5|OTTHUMT00000339889.3|EMC3-203|EMC3|1584|protein_coding|   250 400 A00627:534:HNHVVDSX5:2:1101:17381:1000/1    1   +
ENST00000470827.3|ENSG00000125037.13|OTTHUMG00000128652.5|OTTHUMT00000339889.3|EMC3-203|EMC3|1584|protein_coding|   322 472 A00627:534:HNHVVDSX5:2:1101:17381:1000/2    1   -
ENST00000335181.10|ENSG00000067225.21|OTTHUMG00000172709.6|OTTHUMT00000420056.2|PKM-202|PKM|2305|protein_coding|    1626    1776    A00627:534:HNHVVDSX5:2:1101:18738:1000/1    0   +
ENST00000335181.10|ENSG00000067225.21|OTTHUMG00000172709.6|OTTHUMT00000420056.2|PKM-202|PKM|2305|protein_coding|    1766    1916    A00627:534:HNHVVDSX5:2:1101:18738:1000/2    0   -
ENST00000697140.1|ENSG00000147065.18|OTTHUMG00000021723.5|-|MSN-214|MSN|3880|protein_coding_CDS_not_defined|    30353185    A00627:534:HNHVVDSX5:2:1101:23746:1000/1    1   -
ENST00000697140.1|ENSG00000147065.18|OTTHUMG00000021723.5|-|MSN-214|MSN|3880|protein_coding_CDS_not_defined|    28913041    A00627:534:HNHVVDSX5:2:1101:23746:1000/2    1   +
ENST00000590996.6|ENSG00000188554.15|OTTHUMG00000180878.2|OTTHUMT00000453461.2|NBR1-206|NBR1|4613|protein_coding|   3197    3347    A00627:534:HNHVVDSX5:2:1101:25699:1000/1    1   -
ENST00000590996.6|ENSG00000188554.15|OTTHUMG00000180878.2|OTTHUMT00000453461.2|NBR1-206|NBR1|4613|protein_coding|   3149    3299    A00627:534:HNHVVDSX5:2:1101:25699:1000/2    1   +
ENST00000394936.8|ENSG00000197746.15|OTTHUMG00000018429.6|OTTHUMT00000048553.3|PSAP-201|PSAP|2748|protein_coding|   329 479 A00627:534:HNHVVDSX5:2:1101:31575:1000/1    3   +
ENST00000394936.8|ENSG00000197746.15|OTTHUMG00000018429.6|OTTHUMT00000048553.3|PSAP-201|PSAP|2748|protein_coding|   433 583 A00627:534:HNHVVDSX5:2:1101:31575:1000/2    3   -

Is it due to low alignment to the refTE? Is there any file structure issue? Is there anything I can do to check the files prior to performing the further analysis? Is there any checking steps at the beginning of chimeraTE already included? Thanks so much for the support Anna

OliveiraDS-hub commented 4 months ago

Hi Anna, thank you for the follow up.

If there is no file in your alignment/fpkm_counts/, then definitely something goes wrong when quantifying gene expression with express. However, this step is just a way of removing undesired genes to look for chimeras, after all it's pointless to try to identify chimeric reads on non-expressed genes.

Since there is an issue with gene expression quantification (still unknown), chimeraTE goes further with all genes. This is supposed to have a time cost for you, but the analysis should provide the same result as you would obtain with the quantification of gene expression.

After the message : Identifying chimeric transcripts with chimeric reads evidence...

do you have the same error as Ifengel? as it follows from his/her issue:

Merging coverage from different isoforms... Traceback (most recent call last): File "chimTE_mode2.py", line 188, in <module> merging_transc() File "scripts/mode2_chim_transcripts.py", line 50, in merging_transc chimeras[['isoform','gene']] = chimeras.gene_transc.str.split("_",expand=True) File "/home/landeiralab/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/core/frame.py", line 3041, in __setitem__ self._setitem_array(key, value) File "/home/landeiralab/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/core/frame.py", line 3067, in _setitem_array raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key

If so, please let's close this issue about running time, and then we proceed about it on Ifengel's issue.

Thank you Anna!

OliveiraDS-hub commented 3 months ago

Dear @anna3106

ChimeraTE has been updated to version 1.2 with several minor bug fixes. Give it try with the updated version and let me know if you are still facing any error.

Closing this issue, feel free to reopen if you need

Cheers