GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

TAMA collapse IndexError #126

Open mainciburu opened 7 months ago

mainciburu commented 7 months ago

Hi, I'm trying to run TAMA collapse on ONT aligned reads that have gone through the following pipeline: -Adapter trimming and reorienting with Pychopper

python tama/tama_collapse.py -s reads.filtered.sorted.sam -f $ref_genome_dir/$ref_genome -x no_cap -a 100 -z 100 -m 10 -p tama/collapse/mysample

And I'm getting this error

sam count 1530000 Traceback (most recent call last): File "tama/tama_collapse.py", line 5190, in <module> [h_count,s_count,i_count,d_count,mis_count,nomatch_dict,sj_pre_error_list,sj_post_error_list] = calc_error_rate(start_pos,cigar,seq_list,scaff_name,read_id) File "tama/tama_collapse.py", line 847, in calc_error_rate next_cig_flag = cig_char_list[i + 1] IndexError: list index out of ran

I've guess it has something to do with the cigar string and errors around splice junction. I've also found the first read in my sam that causes the error (I think there's more than one). It's this one:

99:1914|a1ac6b88-3c74-4652-8b19-b682f6e0e40d 0 NC_027836.1 1435 60 2X3=1D1X1=1X4=4I3=3X3=3X166=1X8=1X11=3D104=1I84=2D56=1X111=2X113=1X74=1I1X12=2X6=1X8=1X26=1X2=1X17=1X6=1X88=1X15=1X58=1X137=1X38=1X11=1X5=1X14=1X7=1X10=1X31=1I3=1I32=44I1X2=1I2=1D1X2=1D2=180I1=1X2=1X3=1X2=1D1=1X43=1X2=1X3=1D1=1X3=1X45=1D112=1X7D2992N * 0 0 CGACTTGGCATTAGAATTAGGCTTTGGGGCGAAAATGACTTTATTCAACAAATCATAAAGATATTGGAACATTATATTTTATTTTTGGAATTTGAGCAGGGATAGTAGGTACTTCTTTAAGTTTATTAATTCGAGCTGAATTAGGGACTCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACTATTGTAGCAGCTCATACTTTTATTATATTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGAAATTGACTTGTACCTTTAATATTAGGAGCCCCTGATATAGCTTTCCCACGTATAAATAATATAAGTTTTTTGACTTTTACCCCCATCTTTAACTTTATTAATTTCTAGTAGCATTGTAGAAAATGGAGCAGGAACTGGATGAACAGTTTACCCCCCTCTCCTCTAATATTGCTCATGGTGGTAGTTCAGTAGATTTAGCTATTTTCCCACTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCTATTAACTTTATTACTACTATTATTAATATACGATTAAATAATTTATCATTTGATCAAATACCTTTATTTATTTAGGCTGTAGGTATTACTGCATTCTTATTATTATTATCTTTACCTGTTTTAGCCGGAGCTATTACTATATTACTTACTGATCGAAATTTAAATACATCATTTTTCGATCCTGCAGGTGGAGGTGATCCTATTCTTTATCAACATTTATTTTGATTTTTTGGACATCCTGAAGTATATATTTTAATTTTACCAAGGATTTGGTATACATTCTCATATTATTTCCCAAGAAAGAGGTAAAAAGGAAACATTCGGGTGTTTAGGTATAATTTACGCTATACTAGCAATTGGTTTATTAGGATTTATTGTTTGAGCTCATCATATATTTACTGTAGGAATAGATATTGATACACGAGCATATTTTACATCAGCAACAATAATTATTACTGTACCAACAGGTATTAAAATTTTTAGTTGATTAGCTACTTTCCATGGAACTCAAATTAATTATTCCCCATCTATTTTATGAAGATTAGGATTTGTATTTTTATTTACTGTAGGAGGATTAACAGGTGTAATTTTATCTAATTCTTCTATTGATATTACTTTACATGATACTTACTATGTAGTTGCTCATTTCCATTATGTTTTATCAATAGGAGCTGTATTTGCTATTTTAGGGGGATTTATTCATTGATACCCATTATTTACTGGTTTATCTTCAAATCCTTATTTATTAAAAATTCAATTTTTTATTATATTTATCCGGAAGTAAATTTAACTTTCTTCCCACAACATTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATGCATGCCTGGTGGTATTCTGATTATCCTGATTCTTATATTTCATGAAATATTATTTTATTATTGAATCATATATTTCATTATTAGCAATCATATTTATATTAATTATTATTTGAAATCTATAATTAATCAACGAATTTCTTTATTTACTTTAAATTTATCTTCTTCAATTGAATGATATCAAAATTTACCACCAGCTGAACATTCATATAATGAATTGCCTATTTTT * XA:Z: XC:Z:Insufficient_junction_coverage_unclassified NM:i:297

Could you help me out understand what's going on? How could I filter the sam file to get rid of problematic reads? Thank you!

ksahlin commented 7 months ago

Hi,

My bad. I am the developer of uLTRA. What you observe is an error in how uLTRA output the cigar (problematic cigar string as it ends with 112=1X7D2992N). It would take me some time to track down and fix this error. I do see that the read looks problematic.

In the meantime, perhaps it is possible to handle such errors gracefully in TAMA with a try except statemrnt and printing a warning if an error occurs in this part of the code?

Best, Kristoffer

mainciburu commented 7 months ago

Thanks a lot for the fast reply! I've been trying to make sense of this error for a few days and I would have never found the answer. When you say that read I showed looks problematic, is it a formatting issue or does it look bad quality / bad alignment...? In any case, I found and removed the reads with cigar string ending in N (it was just 3). I run TAMA again and the original error disappeared! Unfortunately, now I have a new error, probably not related to uLTRA anymore (?). I've seen it has been reported a couple of times here, but didn't get a clear idea of what to do about it in my case. It's the following:

Error with exon end earlier than exon start 738633 738633 ['105:518|689ca4ad-c28a-4bc4-aa8c-51cede2beb56', '99:514|04e01855-0675-4357-bac9-8cc36ad5497b', '133:524|25e08f18-7d09-4bb4-aab9-bfe7f4d66be7', '132:413|493618ca-d927-44c7-931e-c99a14ff3822', '105:523|9fbfd390-4964-454d-881d-3441cb994cfa', '119:412|dc5a1936-d9f9-4944-af15-ade882739d93', '121:325|54554b0c-84d6-4611-bdbf-c0fc12a887d3', '131:562|8a715b1c-eb42-4789-b79e-787b500e2b8c', '138:488|9b56f899-461f-4407-a590-8fa04b572ad4', '128:537|6655c133-3d0b-45d3-9e7d-3f1c09128914', '105:383|5c6fe978-7fba-46bb-8a79-b29c1f020d63', '108:395|79a13c40-b842-4f9f-b895-4a5fb21f61e0', '126:383|7cd9ba40-065a-4ee3-b6be-cbc70c24f81d', '139:519|20ff1d7f-5853-41bc-95ba-67141321180f', '126:415|09635947-bb94-4922-a3cf-acaff07eeac9', '130:541|45d993fe-091a-4a45-8944-adee3ab2e009', '136:549|9b0186ba-411a-4699-b259-552a1b257141', '98:311|250ea703-de96-40e3-9724-2084768bb525', '127:509|4ebcafbb-83af-4fdc-ba1b-b5523f43d734', '128:374|6026b742-a4d9-4d27-b33b-36956afd61fd', '129:379|8dea2747-d088-46f6-99a0-861dcd40c336', '129:384|2af9aaea-c4b1-477f-89e3-5b44629e65d1', '121:591|700b6c2b-320c-4a22-8c54-3327954e1be6', '126:507|de27e02f-378c-42c6-8174-4efa6d594b58', '104:411|f2bc7335-a24b-4a0d-9feb-c0835e17be5c', '136:545|e12af295-37af-4ef3-a6aa-2e895b22a41b', '129:417|68ad3d6c-e02a-4819-8f42-ca404a023041', '98:309|80ef9789-6713-4838-afd3-b33a35c91918', '136:574|f2dc187d-0b30-467b-8fbd-c0cf66bff3e5', '137:398|1c647ab7-c18a-4373-8d28-e8adb668983f', '131:532|e2e37bc5-1f95-4ecb-8ba8-b5d90f22e17e', '121:531|8bc49d2e-ac63-41c4-ab4f-77f9532d78d3']

So I still have a bunch of problematic reads... I attach the sam file with the reads mentioned in the error message here: error_subset_sam.txt I'll appreciate any hint on how to proceed. Thank you!!

ksahlin commented 7 months ago

When you say that read I showed looks problematic, is it a formatting issue or does it look bad quality / bad alignment...?

I was referring to the long homopolymer/STR. However, a mature aligner should handle these things seamlessly (diss to my own aligner). Although I now see that also other established aligners are struggling with the cigar output as reported in this repo.

I've seen it has been reported a couple of times here, but didn't get a clear idea of what to do about it in my case.

Not sure. Perhaps you can try to remove also those reads (as done here: https://github.com/GenomeRIK/tama/issues/113). The 30-40 reads affected is only a tiny subset of your data I assume(?), so shouldn't change your results notably unless they are all from some gene(s) that you are interested in.

mainciburu commented 6 months ago

Thanks for the clarification! And yeh, I ended up removing those reads too.