bcgsc / pavfinder

:mag: Post Assembly Variants Finder
Other
17 stars 5 forks source link

GTF problem with transcript_id record - KeyError: 'transcript_id' #7

Closed af8 closed 4 years ago

af8 commented 4 years ago

Hi,

I would like to give a try to fusion-bloom tools to search for fusions in human RNAseq data in a clinical setting.

I encountered an issue in the last step (pavfinder fusion) :

pavfinder 1.6
Traceback (most recent call last):
  File "/home/anthony/sw/miniconda3/envs/fusion-bloom-env/bin/find_sv_transcriptome.py", line 260, in <module>
    main()
  File "/home/anthony/sw/miniconda3/envs/fusion-bloom-env/bin/find_sv_transcriptome.py", line 208, in main
    only_fusions=args.only_fusions
  File "/home/anthony/sw/miniconda3/envs/fusion-bloom-env/lib/python2.7/site-packages/pavfinder/transcriptome/sv_finder.py", line 267, in find_events
    block_matches = self.exon_mapper.map_align(aligns[0])
  File "/home/anthony/sw/miniconda3/envs/fusion-bloom-env/lib/python2.7/site-packages/pavfinder/transcriptome/exon_mapper.py", line 256, in map_align
    if not self.transcripts_dict.has_key(record.transcript_id):
  File "pysam/libctabixproxies.pyx", line 635, in pysam.libctabixproxies.GTFProxy.__getattr__
KeyError: 'transcript_id'

I am using GTF annotation file from GENCODE and indeed there are some (gene) lines without a defined transcript_id field.

Looking at the GTF format description, transcript_id field must be present in every GTF record though.

What would you suggest ?

Thanks, Anthony

af8 commented 4 years ago

Reading about what a proper GTF should be and the code from pavfinder/transcriptome/transcript.py where only exon and CDS features are loaded in the object, I have filtered the GTF file and it works fine now.