GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
128 stars 25 forks source link

tama_flnc_polya_cleanup.py fails for "A"-only sequences #52

Closed ebioman closed 3 years ago

ebioman commented 3 years ago

Hi The script tama_flnc_polya_cleanup.py fails in my hand for some IsoSeq derived sequences. Trying to find the source I had sequences which contained only PolyA. E.g.

>transcript/113571 full_length_coverage=9;length=189
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAA

If such a sequence is part of the FASTA file given to the script it fails with the message:

Traceback (most recent call last):
  File "/tmp/tama/tama_go/sequence_cleanup/tama_flnc_polya_cleanup.py", line 153, in <module>
    this_block_obj = a_block_dict[this_block_index]
KeyError: 0

If I do modify the sequence and add e.g. a single C at the position 1 everything is fine. This is obviously pretty rare but made the script fail for some of my IsoSeq data. A more biology related issue is maybe the fact that if we add instead e.g. another nucleotide at any different spot in the sequence it works and takes everything 5' as sequence and everything 3' as tail. I guess I would argue that in this case it is much more likely that we have a wrong base-call instead.

GenomeRIK commented 3 years ago

Hello,

Apologies foe the delayed response. I don't seem to be getting the email notifications for issues.

Can you tell me how you processed this data to get to that fasta file? There really shouldn't be any sequences at that point which are pure A's.

I can think of a few ways to deal with this but I just want to understand the problem better.

Thank you, Richard