GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
128 stars 25 forks source link

tama_collapse error when writing variant file (IndexError: list index out of range) #42

Closed martsper closed 3 years ago

martsper commented 3 years ago

Hi Rik,

First of all, thanks for providing this wonderful tool. It is exactly what I am looking for, now I only have to get it running ;)

I am doing tama_collapse.py on a GMAP mapping file. The tool runs fine and creates nice outputs. However, it stops during creating the *_variants.txt file, and returns the following error:

Writing variant file Traceback (most recent call last): File "/home/martin/tama/tama_collapse.py", line 6216, in ref_allele = fasta_dict[scaffold][var_pos] IndexError: list index out of range

(Python 2.7.15; Biopython 1.76; Ubuntu 18.04)

I would highly appreciate any ideas.

Best, Martin

GenomeRIK commented 3 years ago

Hi Martin,

Thank you for using TAMA!

Could you send the command line you used for running TAMA?

Cheers, Richard

martsper commented 3 years ago

I use: python tama_collapse.py -s input.bam -b BAM -f reference.fna -p test -x capped

Best, Martin

GenomeRIK commented 3 years ago

Hi Martin,

For some reason there is some discrepancy between the variant position and what is represented in the reference. Did anything write out to the variants.txt? If so, could you copy/paste the last 5 lines of that file?

Do you also have a summary of all the scaffolds and lengths for the reference fasta?

Cheers, Richard

martsper commented 3 years ago

Hi Rik,

yes, 310 lines were written in the variants.txt (51,916 lines are in the input.bam file, with 46,661 mapped to reference.fna). These are the last 5 lines of test_variants.txt

NW_005195441.1 360614 M G C 5 5 NODE_23617_length_1137_cov_35.321818_g71_i4,NODE_5079_length_2639_cov_22.418909_g71_i2,NODE_34966_length_646_cov_24.824302_g71_i10,NODE_30186_length_850_cov_25.354244_g71_i8,NODE_23849_length_1127_cov_19.700917_g71_i5 NW_005195441.1 360647 M G A 5 5 NODE_23617_length_1137_cov_35.321818_g71_i4,NODE_5079_length_2639_cov_22.418909_g71_i2,NODE_34966_length_646_cov_24.824302_g71_i10,NODE_30186_length_850_cov_25.354244_g71_i8,NODE_23849_length_1127_cov_19.700917_g71_i5 NW_005195441.1 360733 M G A 5 5 NODE_23617_length_1137_cov_35.321818_g71_i4,NODE_5079_length_2639_cov_22.418909_g71_i2,NODE_34966_length_646_cov_24.824302_g71_i10,NODE_30186_length_850_cov_25.354244_g71_i8,NODE_23849_length_1127_cov_19.700917_g71_i5 NW_005195441.1 361087 M G C 5 5 NODE_23617_length_1137_cov_35.321818_g71_i4,NODE_5079_length_2639_cov_22.418909_g71_i2,NODE_34966_length_646_cov_24.824302_g71_i10,NODE_30186_length_850_cov_25.354244_g71_i8,NODE_23849_length_1127_cov_19.700917_g71_i5 NW_005195441.1 362467 M G A 5 5 NODE_23617_length_1137_cov_35.321818_g71_i4,NODE_24836_length_1081_cov_41.305556_g71_i6,NODE_33619_length_706_cov_30.431988_g71_i9,NODE_26798_length_996_cov_43.745568_g71_i7,NODE_37181_length_553_cov_34.653101_g71_i12

Best, Martin

martsper commented 3 years ago

here is a visualization of the alignment of the region that causes the error:

image

Best, Martin

GenomeRIK commented 3 years ago

Hi Martin,

I think I might know what is going on here. I think TAMA called a variant from a clipped read mapping which is technically hanging off the scaffold. However, in order to test this I will need the SAM lines for scaffold "NW_005195441.1" and the reference fasta.

If you cannot send these to me, you can try checking yourself by looking at the SAM lines for that scaffold and looking at the cigar string.

Or you can copy paste the last 10 SAM lines for that scaffold and I can have a look.

Cheers, Richard

martsper commented 3 years ago

Hi,

this sounds promising.

The scaffold can be downloaded here: https://www.ncbi.nlm.nih.gov/nuccore/NW_005195441.1/

I attached the 10 SAM lines for the above depicted 10 mapping contigs that seem to cause the problem (which are, btw, ranSPAdes assembled PacBio IsoSeq HQ isoforms + Illumina 2x75 paired end reads).

mapping.txt

Best, Martin

GenomeRIK commented 3 years ago

Hi Martin,

I just tried running TAMA Collapse using the files you provided and I did not get the error. I think the error is happening with the reads mapping to the next scaffold in the SAM file.

Could you either send me the SAM lines for the reads mapping to the next scaffold in the SAM file or could you provide the whole SAM file?

Cheers, Richard

martsper commented 3 years ago

Hi,

I really appreciate your effort to look into it. I will send you a link in private which includes the original reference genome and the GMAP bam mapping file which I used as input for tama_collapse.

Best, Martin

martsper commented 3 years ago

Hi,

I have another question, which is maybe related. I have 46,661 sequences in my bam file, which mapped to my reference genome.

Based on this, and according to what is given in the tama_collapse.py wiki, I would expect to also find 46,661 entries in the _trans_read.bed file. However, I find only 19,350 entries in this file. And similarly, I would expect that the number of entries in _trans_read.bed and *_trans_read_report.txt are equal, but they are not (19,350 vs 19,315). Is this maybe related to my above mentioned error, or did I misunderstood some of the steps of tama_collapse.py?

Best, Martin

GenomeRIK commented 3 years ago

Hi Martin,

Ok thanks for sending the files. I was able to use them to figure out what was happening. As suspected the issue is coming from reads mapping off the scaffolds. I have fixed TAMA Collapse so it just ignores these situations. Can you re-clone and try running again?

As for your new questions, I can explain:

The _trans_read.bed contains the bed lines generated for each mapped read which passes the TAMA Collapse thresholds (identity, coverage, lde). So this will be a subset of all mapped reads and is almost never all mapped reads. This file also contains the relationships between each supporting read and the transcript model it supports. If you run in no_cap mode then that means a single read can support multiple transcript models if it looks like a 5' degraded product which fits into multiple 5' longer models.

Instead the _read.txt file should contain all mapped reads in the SAM/BAM file. This file keeps track of which reads were accepted and which were discarded. It also shows reasons for discarding.

There shouldn't be any output files with this suffix "_trans_read_report.txt". I think you are referring to the _trans_report.txt file which contains collapse stats for each transcript model.

I hope this clarifies the output files relationships but please let me know if you have further questions.

Also please let me know if the updated TAMA Collapse is now working for you.

Cheers, Richard

martsper commented 3 years ago

Hi, I am out of office, will try it next week. Thanks, Martin

martsper commented 3 years ago

Hi Richard, looking good; tama_collapse.py runs now smoothly. I also start to understand the meaning and information given in the output files. I really appreciate your help. Best, Martin

GenomeRIK commented 3 years ago

Hi Martin,

Glad it works now. I have added a bit more information to the wiki to hopefully clarify the output files.

Cheers, Richard