Closed mcmero closed 5 years ago
Hi Marek,
Did you index (tabix) the gtf file?
Readman
Hi Readman,
Yes, the gtf file was tabix indexed. I tried regenerating the index and rerunning, and run into the same issue.
Seems like the chimera is found from the c2t alignment. If you run it with just --gbam c2g.bam, do you get the same error? Looks like your gtf is generated from UCSC refgene, I'll do some testing off that and see if it's related to it.
If I run with just --gbam c2g.bam
I don't get the error. Does this limit the variants pavfinder reports?
If it helps, I generated my gtf file like so:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg38 | cut -f2- | genePredToGtf -source=hg38.refGene.ucsc file stdin hg38_refseq.gtf
Thanks
The c2t alignment is mainly used to catch events missed by c2g; in most circumstances Gmap does a pretty good job of detecting chimeras. Could you try running with just —tbam c2t.bam to see if you get the error? I am really puzzled why you don’t see the error with just c2g alone
I get the error if I run with --tbam c2t.bam
only (without --gbam c2g.bam
).
I just generated the gtf file the way you did, and used the NUP98-NSD1 fusion sequence in th PAVFinder test dataset and has no problem:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg38 | awk '$3!~/[_M]/' | cut -f2- | genePredToGtf -source=hg38.refGene.ucsc file stdin hg38_refseq.gtf
cat hg38_refseq.gtf | sort -k1,1 -k4,4n | bgzip > hg38_refseq.gtf.gz
tabix -p gff hg38_refseq.gtf.gz
extract_transcript_sequence.py hg38_refseq.gtf.gz hg38_refseq.gtf.fa /projects/btl/rchiu/hg38.fa --index --only_longest
pavfinder fusion --tbam c2t.bam --transcripts_fasta hg38_refseq.gtf.fa --genome_index /path/to/gmapdb_sarray/hg38 hg38 test.fa hg38_refseq.gtf.gz hg38.fa out --only_fusions
The only thing I noticed from the problematic alignment is they both align to the same HLA gene, but I failed to see how it leads to the error.
Can you check if NR_001434 is present in both the gtf and transcripts fasta? Can you post or send me the sequence of "allvars.E1.L.4703" for me to debug?
I tried regenerating my transcriptome reference using the extract_transcript_sequence.py
script, and it's now working. Would be good to highlight this script in the documentation for future users.
Thanks for your help.
Good, I'll update the Usage page later.
I'm trying to run pavfinder's find_sv_transcriptome.py like so:
But am running into this issue:
I've checked NR_001434 and it's in both the GTF reference and transcriptome file. I also checked whether that transcript ID is duplicated in my transcriptome fasta file (it isn't). My transcriptome fasta looks something like this (with the longest transcript per gene):
And my GTF file looks like: