OliveiraDS-hub / ChimeraTE

A pipeline to detect chimeric transcripts derived from genes and transposable elements.
GNU General Public License v3.0
18 stars 4 forks source link

Errors in merging coverage from different isoforms with mode2 #19

Closed Lynuxoo closed 3 months ago

Lynuxoo commented 3 months ago

Hello,

Thanks for your tool!

First, I got low levels of alignment of reads against transcirpt fasta, but it still continued to work.

[Tuesday 12/3/2024 - 16h:10]    Creating bowtie2 index for transcripts...
Done!
[Tuesday 12/3/2024 - 16h:21]    Perfoming bowtie2 alignment for TEs...
39866443 reads; of these:
  39866443 (100.00%) were paired; of these:
    39705128 (99.60%) aligned concordantly 0 times
    63724 (0.16%) aligned concordantly exactly 1 time
    97591 (0.24%) aligned concordantly >1 times
    ----
    39705128 pairs aligned concordantly 0 times; of these:
      4525 (0.01%) aligned discordantly 1 time
    ----
    39700603 pairs aligned 0 times concordantly or discordantly; of these:
      79401206 mates make up the pairs; of these:
        79105850 (99.63%) aligned 0 times
        145735 (0.18%) aligned exactly 1 time
        149621 (0.19%) aligned >1 times 
0.79% overall alignment rate
[Tuesday 12/3/2024 - 16h:26]    Perfoming bowtie2 alignment for transcripts...
39866443 reads; of these:
  39866443 (100.00%) were paired; of these:
    8430841 (21.15%) aligned concordantly 0 times
    5273962 (13.23%) aligned concordantly exactly 1 time
    26161640 (65.62%) aligned concordantly >1 times
    ----
    8430841 pairs aligned concordantly 0 times; of these:
      378243 (4.49%) aligned discordantly 1 time
    ----
    8052598 pairs aligned 0 times concordantly or discordantly; of these:
      16105196 mates make up the pairs; of these:
        12899808 (80.10%) aligned 0 times
        309410 (1.92%) aligned exactly 1 time
        2895978 (17.98%) aligned >1 times
83.82% overall alignment rate
Done!
[Tuesday 12/3/2024 - 16h:56]    Calculating transcripts expression...
Unable to calculate expression due to low rate of alignment! Including all transcripts to the downstream analysis...
Done!

And I had problems when it come into the step to merge coverage from different isoforms.

[Tuesday 12/3/2024 - 16h:56]    Identifying chimeric reads...
Done!
Done!
[Tuesday 12/3/2024 - 17h:15]    Identifying chimeric transcripts with chimeric reads evidence...                                         [Wednesday 13/3/2024 - 00h:16] Merging coverage from dif
ferent isoforms...
Traceback (most recent call last):
  File "chimTE_mode2.py", line 190, in <module>
    expression()
  File "scripts/mode2_chim_transcripts.py", line 78, in expression
    fpkm_gene = pd.read_csv(str(aln_dir + '/fpkm_counts/results.xprs'), sep="\t", usecols=[1,12], names=['isoform_id', 'fpkm'])
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in __init__
    self._make_engine(self.engine)
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: '/rd/huangfl/ChimeraTE/projects/brainvar/rep1/alignment/fpkm_counts/results.xprs'

I try to follow the suggestion you give in #15 . I check my format of transcript fasta file.

(chimeraTE) [huangfl@darwin ChimeraTE]$ grep ">" /rd/huangfl/ChimeraTE/data/GRCh38.p14.transcript.fasta |head
>NM-000014.6_A2M
>NM-000015.3_NAT2
>NM-000016.6_ACADM
>NM-000017.4_ACADS
>NM-000018.4_ACADVL
>NM-000019.4_ACAT1
>NM-000020.3_ACVRL1
>NM-000021.4_PSEN1
>NM-000022.4_ADA
>NM-000023.4_SGCA
(chimeraTE) [huangfl@darwin ChimeraTE]$ grep ">" /rd/huangfl/ChimeraTE/data/GRCh38.p14.transcript.fasta | sort | uniq -c | awk '$1 > 1' | head

This is my genes_expressed_IDs.lst

head /rd/huangfl/ChimeraTE/projects/test/rep1/alignment/genes_expressed_IDs.lst
NR-002312.1_RPPH1
NM-006305.4_ANP32A
NR-003051.4_RMRP
NM-015316.3_PPP1R13B
XM-006715477.3_LIN28B
NR-001445.2_RN7SK
NM-001369473.1_NFIB
NM-001324255.2_MAP1B
NM-000996.4_RPL35A
XM-047443839.1_SF3B1

And this is my head genes.bed

head /rd/huangfl/ChimeraTE/projects/test/rep1/alignment/genes.bed
NR-002312.1_RPPH1       159     260     K00162:204:HLJ2MBBXX:4:1101:4472:1226/1 42      -
NR-002312.1_RPPH1       8       109     K00162:204:HLJ2MBBXX:4:1101:4472:1226/2 42      +
NM-006305.4_ANP32A      494     595     K00162:204:HLJ2MBBXX:4:1101:4655:1226/1 16      -
NM-006305.4_ANP32A      138     239     K00162:204:HLJ2MBBXX:4:1101:4655:1226/2 16      +
NR-003051.4_RMRP        96      197     K00162:204:HLJ2MBBXX:4:1101:4838:1226/1 42      -
NR-003051.4_RMRP        14      115     K00162:204:HLJ2MBBXX:4:1101:4838:1226/2 42      +
NM-015316.3_PPP1R13B    536     630     K00162:204:HLJ2MBBXX:4:1101:5000:1226/1 0       -
XM-006715477.3_LIN28B   1760    1861    K00162:204:HLJ2MBBXX:4:1101:5244:1226/1 1       -
XM-006715477.3_LIN28B   1717    1818    K00162:204:HLJ2MBBXX:4:1101:5244:1226/2 1       +
NR-001445.2_RN7SK       143     244     K00162:204:HLJ2MBBXX:4:1101:5365:1226/1 12      -

Is there anything ununsual or the problem is just caused by my low level of alignment?

Many thanks.

OliveiraDS-hub commented 3 months ago

Dear @Lynuxoo

Your not facing exactly the same issue as the one from #15

ChimeraTE has been updated today to the version 1.2. Among several minor bugs, I've considered yours.

Now, even though you don't have generated results.xprs file, which is causing your issue, the pipeline moves on.

Closing this issue, feel free to reopen if you need

Cheers

Lynuxoo commented 3 months ago

Dear @OliveiraDS-hub

Thanks for your response and updates!

However, I 'm sorry that I have problems again with the version 1.2 so I need to reopen the issue. At this time the pipeline even don't move on.

[Friday 15/3/2024 - 11h:37]     Perfoming bowtie2 alignment for transcripts...
54063342 reads; of these:
  54063342 (100.00%) were paired; of these:
    17277796 (31.96%) aligned concordantly 0 times
    5674228 (10.50%) aligned concordantly exactly 1 time
    31111318 (57.55%) aligned concordantly >1 times
    ----
    17277796 pairs aligned concordantly 0 times; of these:
      131105 (0.76%) aligned discordantly 1 time
    ----
    17146691 pairs aligned 0 times concordantly or discordantly; of these:
      34293382 mates make up the pairs; of these:
        31495021 (91.84%) aligned 0 times
        411346 (1.20%) aligned exactly 1 time
        2387015 (6.96%) aligned >1 times
        70.87% overall alignment rate
Done!
[Friday 15/3/2024 - 13h:02]     Calculating transcripts expression...
Unable to calculate gene expression! Including all transcripts with at least one read to the downstream analysis...
Traceback (most recent call last):
  File "chimTE_mode2.py", line 179, in <module>
    alignment_func(out_dir, aln_dir, mate1, mate2)
  File "scripts/mode2_alignment.py", line 50, in alignment_func
    pd.read_csv(str(f"{aln_dir}/genes_total_expressed.bed"), header=None, sep="\t", usecols=[0],names=['gene_id']).drop_duplicates().to_csv(f"{aln_dir}/genes_expressed_IDs.lst", header=None, index=False)
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in __init__
    self._make_engine(self.engine)
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/huangfl/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: '/rd/huangfl/chimeraTE_1.2/ChimeraTE/projects/brainvar_mode2/P1_rep1_HSB272/alignment/genes_total_expressed.bed'

This is what I have in the ouput directory.

P1_rep1_HSB272
└── alignment
    ├── fpkm_counts
    ├── genes.bam
    ├── genes.bed
    ├── tes.bam
    └── tes.bed

Thank you again for your help.

OliveiraDS-hub commented 3 months ago

Dear @Lynuxoo

I'm so sorry for my oversight. Indeed the handling of exception was not correcly done.

Now, I've succesfully obtained your error message, and after the correction everything went smoothly.

Please, remove all mode2 scripts from the folder "scripts", and also chimTE_mode2.py from the main folder.

Download them again from github, and put them in their respective directories.

Then, run your analysis again using the same --projectname that you have used before. This will avoid the creation of index and the performing the alignments again.

Now ChimeraTE Mode 2 checks whether the output of each step has been already created.

If you follow this steps you will have this std output on your screen:

[Friday 15/3/2024 - 12h:22] Creating bowtie2 index for TEs...
TE index has been found!
Skipping...

[Friday 15/3/2024 - 12h:22] Creating bowtie2 index for transcripts...
Transcripts has been index found
Skipping...

Running analysis with ------------------------------------------> rep1

[Friday 15/3/2024 - 12h:22] Perfoming bowtie2 alignment for TEs...
TE alignment has been found!
Skipping...

[Friday 15/3/2024 - 12h:22] Perfoming bowtie2 alignment for transcripts...
TE alignment has been found!
Skipping...

[Friday 15/3/2024 - 12h:22] Calculating transcripts expression...
Expression files have been found!
Skipping...

[Friday 15/3/2024 - 12h:22] Identifying chimeric transcripts...
Chimeric transcripts file has been found!
Skipping...

[Friday 15/3/2024 - 12h:22] Merging coverage from different isoforms...

The issue is now reopened. I'll be looking forward for your answer if it has worked for you.

Thank you!

Lynuxoo commented 3 months ago

Dear @OliveiraDS-hub

Thanks for your prompt response. With your assistance, I've successfully completed the analysis of my samples using the modified script. I truly appreciate your support in navigating through the process.

I have another question regarding a prompt that appeared: Unable to calculate gene expression! Including all transcripts with at least one read to the downstream analysis... What could be the reason behind this prompt? Could it be due to a low alignment rate? The alignment rates for some of my samples are as low as forty percent, whereas previously when aligning these same samples to the reference genome, the alignment rates were around ninety percent. If this is due to my data, what impact might it have on the analysis results I obtain?

Once again, thank you for your invaluable help and for creating such a useful tool!

OliveiraDS-hub commented 3 months ago

Thank you @Lynuxoo, I hope ChimeraTE can help with your next findings!

Regarding the impact of including all transcripts instead of only those with a certain level of expression (default FPKM >= 1), this will only increase your processing time. In big genomes such as many plants and mamals, it's faster making the expression analysis and reducing the number of genes to those with expression, than computing all genes. Since very low expressed genes are not likely to produce chimeras, it's reasonable to not check for chimeric reads on them.

About your warning message, many reasons could explain that, such as unexpected special characters in the IDs of your transcripts, uncorrect strand parameters, and of course low alignment rate. In your case I doubt of any relation with alignment rate, it must be something else.

If you want to investigate the causes, we need to check the stdout and stderror message from express.

You can enable express messages changing a unique line on ChimeraTE's code. To do so, change the line 49 of the mode2_alignment.py script:

subprocess.call(['express',` '-o', str(f"{aln_dir}/fpkm_counts"), '-O', '1', '--output-align-prob', '--no-bias-correct', str('--' + str(args.strand)), str(args.transcripts), str(f"{aln_dir}/genes.bam")], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

to

subprocess.call(['express',` '-o', str(f"{aln_dir}/fpkm_counts"), '-O', '1', '--output-align-prob', '--no-bias-correct', str('--' + str(args.strand)), str(args.transcripts), str(f"{aln_dir}/genes.bam")])

Save it, run your analysis again and tell me what happens.

I'm going to close this issue, and if you want to go further about the causes of express issues, please open another issue to focus on it.

Cheers