KeyError on transcript ID

mcmero commented 5 years ago

I'm trying to run pavfinder's find_sv_transcriptome.py like so:

find_sv_transcriptome.py \
    --gbam rnabloom_1.0.0/c2g.bam \
    --tbam rnabloom_1.0.0/c2t.bam \
    --transcripts_fasta $tx_fasta \
    --genome_index $gmap_dir hg38 \
    --r2c rnabloom_1.0.0/r2c.bam \
    --nproc 4 \
    --probe_len 100 \
    --max_novel_len 20 \
    --subseq_len 50 \
    --min_indel_flanking 10 \
    --min_indel_size 3 \
    --max_homol_len 5 \
    --min_support 4 \
    --include_non_exon_bound_fusion \
    rnabloom_1.0.0/allvars.fa \
    $gtf \
    $gn_fasta \
    pv_1.5.0 \
    --debug

But am running into this issue:

processing allvars.E1.L.4702
processing allvars.E1.L.4703
chimera1 allvars.E1.L.4703 NR_001434 804 1704 1 901 -
chimera2 allvars.E1.L.4703 NR_001434 8 528 901 1421 -
Traceback (most recent call last):
  File "/home/marek.cmero/.local/bin/find_sv_transcriptome.py", line 4, in <module>
    __import__('pkg_resources').run_script('pavfinder==1.5.0', 'find_sv_transcriptome.py')
  File "/group/bioi1/marekc/apps/conda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 661, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/group/bioi1/marekc/apps/conda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(code, namespace, namespace)
  File "/home/marek.cmero/.local/lib/python2.7/site-packages/pavfinder-1.5.0-py2.7.egg/EGG-INFO/scripts/find_sv_transcriptome.py", line 260, in <module>
    main()
  File "/home/marek.cmero/.local/lib/python2.7/site-packages/pavfinder-1.5.0-py2.7.egg/EGG-INFO/scripts/find_sv_transcriptome.py", line 227, in main
    only_fusions=args.only_fusions
  File "/home/marek.cmero/.local/lib/python2.7/site-packages/pavfinder-1.5.0-py2.7.egg/pavfinder/transcriptome/sv_finder.py", line 314, in find_events
    events.extend(process_split_aligns([align1, align2], query_seq, genes))
  File "/home/marek.cmero/.local/lib/python2.7/site-packages/pavfinder-1.5.0-py2.7.egg/pavfinder/transcriptome/sv_finder.py", line 164, in process_split_aligns
    self.update_adj(adj, aligns, query_seq, target_type, block_matches=block_matches)
  File "/home/marek.cmero/.local/lib/python2.7/site-packages/pavfinder-1.5.0-py2.7.egg/pavfinder/transcriptome/sv_finder.py", line 731, in update_adj
    transcripts = [self.transcripts_dict[align.target] for align in adj_aligns]
KeyError: 'NR_001434'

I've checked NR_001434 and it's in both the GTF reference and transcriptome file. I also checked whether that transcript ID is duplicated in my transcriptome fasta file (it isn't). My transcriptome fasta looks something like this (with the longest transcript per gene):

>NR_075077
CGAGTAACCCGCGGAGCCAGAAGAGGGAGGAAAGGAGATGAGATTTCATCATGTTGGCCA
GCCTGGTCTCAAACTCCTGACCTCAAGTGACCCGCCTGCCTCAGCCTCCCAAAGTGCTGG

And my GTF file looks like:

chr1    hg38.refGene.ucsc       transcript      11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018";  gene_name "DDX11L1";
chr1    hg38.refGene.ucsc       exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "1"; exon_id "NR_046018.1"; gene_name "DDX11L1";
chr1    hg38.refGene.ucsc       exon    12613   12721   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "2"; exon_id "NR_046018.2"; gene_name "DDX11L1";

readmanchiu commented 5 years ago

Hi Marek,

Did you index (tabix) the gtf file?

Readman

mcmero commented 5 years ago

Hi Readman,

Yes, the gtf file was tabix indexed. I tried regenerating the index and rerunning, and run into the same issue.

readmanchiu commented 5 years ago

Seems like the chimera is found from the c2t alignment. If you run it with just --gbam c2g.bam, do you get the same error? Looks like your gtf is generated from UCSC refgene, I'll do some testing off that and see if it's related to it.

mcmero commented 5 years ago

If I run with just --gbam c2g.bam I don't get the error. Does this limit the variants pavfinder reports?

If it helps, I generated my gtf file like so:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg38 | cut -f2- | genePredToGtf -source=hg38.refGene.ucsc file stdin hg38_refseq.gtf

Thanks

readmanchiu commented 5 years ago

The c2t alignment is mainly used to catch events missed by c2g; in most circumstances Gmap does a pretty good job of detecting chimeras. Could you try running with just —tbam c2t.bam to see if you get the error? I am really puzzled why you don’t see the error with just c2g alone

mcmero commented 5 years ago

I get the error if I run with --tbam c2t.bam only (without --gbam c2g.bam).

readmanchiu commented 5 years ago

I just generated the gtf file the way you did, and used the NUP98-NSD1 fusion sequence in th PAVFinder test dataset and has no problem:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg38 | awk '$3!~/[_M]/' | cut -f2- | genePredToGtf -source=hg38.refGene.ucsc file stdin hg38_refseq.gtf

cat hg38_refseq.gtf | sort -k1,1 -k4,4n | bgzip > hg38_refseq.gtf.gz

tabix -p gff hg38_refseq.gtf.gz

extract_transcript_sequence.py hg38_refseq.gtf.gz hg38_refseq.gtf.fa /projects/btl/rchiu/hg38.fa --index --only_longest

pavfinder fusion --tbam c2t.bam --transcripts_fasta hg38_refseq.gtf.fa --genome_index /path/to/gmapdb_sarray/hg38 hg38 test.fa hg38_refseq.gtf.gz hg38.fa out --only_fusions

The only thing I noticed from the problematic alignment is they both align to the same HLA gene, but I failed to see how it leads to the error.

Can you check if NR_001434 is present in both the gtf and transcripts fasta? Can you post or send me the sequence of "allvars.E1.L.4703" for me to debug?

mcmero commented 5 years ago

I tried regenerating my transcriptome reference using the extract_transcript_sequence.py script, and it's now working. Would be good to highlight this script in the documentation for future users.

Thanks for your help.

readmanchiu commented 5 years ago

Good, I'll update the Usage page later.

bcgsc / pavfinder

KeyError on transcript ID #3