FusionBloom KeyError - while running pavfinder

think-o commented 3 years ago

I get a KeyError while running pavfinder. Is it something to do with the gtf file format. The file I used was sorted using gff3sort and has "gene_id" and "transcript_id" as it's attributes. I am quite new to Perl and not able to interpret the error. Would be great if anyone could help me with this. The command I have used is this

fusion-bloom profile=/mnt/storage/sklab202/genome_files/fusion-bloom_at_profile left=ERR1876169_1_trim.fastq right=ERR1876169_2_trim.fastq readlen=101 outdir=Fusion_bloom-out name=at_fusion

`processing 42 processing 43 processing 44 processing 45 processing 46 processing 47 Traceback (most recent call last): File "/mnt/storage/sklab202/software/pavfinder_env/bin/find_sv_transcriptome.py", line 259, in main() File "/mnt/storage/sklab202/software/pavfinder_env/bin/find_sv_transcriptome.py", line 226, in main only_fusions=args.only_fusions File "/mnt/storage/sklab202/software/pavfinder_env/lib/python3.6/site-packages/pavfinder/transcriptome/sv_finder.py", line 297, in find_events events.extend(process_split_aligns([align1, align2], query_seq, genes)) File "/mnt/storage/sklab202/software/pavfinder_env/lib/python3.6/site-packages/pavfinder/transcriptome/sv_finder.py", line 158, in process_split_aligns self.update_adj(adj, aligns, query_seq, target_type, block_matches=block_matches) File "/mnt/storage/sklab202/software/pavfinder_env/lib/python3.6/site-packages/pavfinder/transcriptome/sv_finder.py", line 707, in update_adj transcripts = [self.transcripts_dict[align.target] for align in adj_aligns] File "/mnt/storage/sklab202/software/pavfinder_env/lib/python3.6/site-packages/pavfinder/transcriptome/sv_finder.py", line 707, in transcripts = [self.transcripts_dict[align.target] for align in adj_aligns] KeyError: 'AT3G52590.1'

real 15m6.186s user 15m0.582s sys 0m3.113s make: *** [FusionBloom-out/pavfinder/sv.bedpe] Error 1`

kmnip commented 3 years ago

Hi,

Can you grep for AT3G52590.1 and AT3G52590 on your GTF and report your results? e.g. grep AT3G52590.1 annotation.gtf grep AT3G52590 annotation.gtf

Ka Ming

readmanchiu commented 3 years ago

what @kmnip suggested is true, I suspect that in the contig alignment to the transcripts (c2t bam), AT3G52590.1 is in it (i.e it's one of the transcripts you used to align the contigs to) but it is not in the gtf you provided to pavfinder. The transcripts you used for c2t alignment must come from the gtf file you used for pavfinder analysis

to generate the transcripts fasta for alignment in the c2t step, you can run the following script (provided in the package) as pointed out in Usage:

extract_transcript_sequence.py <tabix-indexed gtf> <output transcripts fasta> <reference genome> --index --only_longest

think-o commented 3 years ago

Hi,

Can you grep for AT3G52590.1 and AT3G52590 on your GTF and report your results? e.g. grep AT3G52590.1 annotation.gtf grep AT3G52590 annotation.gtf

Ka Ming

Thanks for the reply. The transcript_id is present in the GTF file.

[sklab202@localhost ERR1876169]$ grep "AT3G52590.1" ../../genome_files/FusionBloom_lib/sorted_at_genome.gtf

[sklab202@localhost ERR1876169]$ grep "AT3G52590.1" ../../genome_files/FusionBloom_lib/sorted_at_genome.gtf 3 araport11 transcript 19505379 19506932 . + . transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 exon 19505379 19505770 . + . transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 CDS 19505668 19505770 . + 0 transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 exon 19505858 19505944 . + . transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 CDS 19505858 19505944 . + 2 transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 exon 19506030 19506132 . + . transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 CDS 19506030 19506132 . + 2 transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 exon 19506512 19506578 . + . transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 CDS 19506512 19506578 . + 1 transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 exon 19506655 19506932 . + . transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3 araport11 CDS 19506655 19506681 . + 0 transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; I have provided the gtf file in a compressed format to the pavfinder. As @readmanchiu was suggesting the transcripts must not have come from the gtf file used for pavfinder analysis. But, I do find the transcript_id in the GTF file. I do not understand where the problem is. This is the fusion bloom profile I have used. export NUM_THREADS=50 export SAMTOOLS_SORT_MEM=200G export GENOME=Arabidopsis_1 export GMAPDB=/mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb export GTF=/mnt/storage/sklab202/genome_files/FusionBloom_lib/sorted_at_genome.gtf.gz export TRANSCRIPTS_FASTA=/mnt/storage/sklab202/genome_files/Arabidopsis_thaliana.TAIR10.cdna.all.fa export GENOME_FASTA=/mnt/storage/sklab202/genome_files/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa export RNABLOOM_PARAMS='-ntcard -fpr 0.005 -chimera -extend' export PAVFINDER_PARAMS='--only_fusions --include_non_exon_bound_fusion --min_support 2'

kmnip commented 3 years ago

OK, how about grep for the same IDs on the reference transcripts fasta file? e.g.

grep -m1 'AT3G52590.1' /mnt/storage/sklab202/genome_files/Arabidopsis_thaliana.TAIR10.cdna.all.fa grep -m1 'AT3G52590' /mnt/storage/sklab202/genome_files/Arabidopsis_thaliana.TAIR10.cdna.all.fa

readmanchiu commented 3 years ago

hmmm...this is strange because the error message says that AT3G52590.1 is not in a dictionary generated from sorted_at_genome.gtf.gz did you use bgzip and tabix to compress and index the gtf?

think-o commented 3 years ago

grep -m1 'AT3G52590.1' /mnt/storage/sklab202/genome_files/Arabidopsis_thaliana.TAIR10.cdna.all.fa

@kmnip Yeah I can find it in transcripts fasta file too AT3G52590.1 cdna chromosome:TAIR10:3:19505379:19506932:1 gene:AT3G52590 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:RPL40B description:Ubiquitin-60S ribosomal protein L40-1 [Source:UniProtKB/Swiss-Prot;Acc:B9DHA6]

think-o commented 3 years ago

hmmm...this is strange because the error message says that AT3G52590.1 is not in a dictionary generated from sorted_at_genome.gtf.gz did you use bgzip and tabix to compress and index the gtf?

@readmanchiu

I have taken a gff3 file available for Arabidopsis thaliana from Ensembl Plants, converted it to gtf using gffread tool (https://github.com/gpertea/gffread) (gffread -T file.gff3 > file.gtf) . Then I have used gff3sort (https://github.com/billzt/gff3sort) to sort the file (gff3sort.pl --precise file.gtf > sorted_file.gtf) . Compressed the sorted file using bgzip. This file was indexed using tabix.

Do you think there is some problem with the generated gtf.gz file. Let me know if there are any mistakes with this

readmanchiu commented 3 years ago

ahh, I see what's the problem. In your gtf file, I saw this:

araport11 transcript 19505379 19506932 . + . transcript_id "transcript:AT3G52590.1"; gene_id "gene:AT3G52590"; 3

but the transcript_id should be AT3G52590.1, not transcript:AT3G52590.1 same issue for gene_id so yeah, I think there is a problem in converting the gff to gtf. Get rid of the "transcript:" in transcript_id and "gene:" in gene_id it should be fine

think-o commented 3 years ago

Thank you so much @readmanchiu . It's working! Although it gives some warnings and some errors, the fusions are generated. I would like to know if these can be neglected.

`GMAP version 2017-11-15 called with args: gmap.avx2 -D /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb -d Arabidopsis_1 FusionBloom-out/pavfinder/subseqs.fa -f samse -t 12
Checking compiler assumptions for SSE2: 6B8B4567 327B23C6 xor=59F066A1
Checking compiler assumptions for SSE4.1: -103 -58 max=198 => compiler zero extends
Checking compiler options for SSE4.2: 6B8B4567 __builtin_clz=1 __builtin_ctz=0 _mm_popcnt_u32=17 __builtin_popcount=17 
Finished checking compiler assumptions
Pre-loading compressed genome (oligos)......done (44,875,416 bytes, 10956 pages, 0.01 sec)
Pre-loading compressed genome (bits)......done (44,875,440 bytes, 10956 pages, 0.00 sec)
Looking for index files in directory /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb/Arabidopsis_1
  Pointers file is Arabidopsis_1.ref153offsets64meta
  Offsets file is Arabidopsis_1.ref153offsets64strm
  Positions file is Arabidopsis_1.ref153positions
Offsets compression type: bitpack64
Allocating memory for ref offset pointers, kmer 15, interval 3...Attached new memory for /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb/Arabidopsis_1/Arabidopsis_1.ref153offsets64meta...done (134,217,744 bytes, 0.12 sec)
Allocating memory for ref offsets, kmer 15, interval 3...Attached new memory for /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb/Arabidopsis_1/Arabidopsis_1.ref153offsets64strm...done (181,880,672 bytes, 0.12 sec)
Pre-loading ref positions, kmer 15, interval 3......done (159,304,604 bytes, 0.01 sec)
Starting alignment
Processed 294 queries in 0.75 seconds (392.00 queries/sec)
Removed existing memory for shmid 262187
Removed existing memory for shmid 262186
**[E::idx_find_and_load] Could not retrieve index file for 'FusionBloom-out/pavfinder/subseqs.bam'**
25465: remove read_through-4-11777607-L-4-11782578-R - subseq 0 multimap
7636: remove read_through-5-5775722-L-5-5778456-R - subseq 0 multimap
Note: gmap.avx512 does not exist.  For faster speed, may want to compile package on an AVX512 machine
GMAP version 2017-11-15 called with args: gmap.avx2 -D /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb -d Arabidopsis_1 FusionBloom-out/pavfinder/probes.fa -n0 -f samse -t 12
Checking compiler assumptions for SSE2: 6B8B4567 327B23C6 xor=59F066A1
Checking compiler assumptions for SSE4.1: -103 -58 max=198 => compiler zero extends
Checking compiler options for SSE4.2: 6B8B4567 __builtin_clz=1 __builtin_ctz=0 _mm_popcnt_u32=17 __builtin_popcount=17 
Finished checking compiler assumptions
Pre-loading compressed genome (oligos)......done (44,875,416 bytes, 10956 pages, 0.01 sec)
Pre-loading compressed genome (bits)......done (44,875,440 bytes, 10956 pages, 0.01 sec)
Looking for index files in directory /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb/Arabidopsis_1
  Pointers file is Arabidopsis_1.ref153offsets64meta
  Offsets file is Arabidopsis_1.ref153offsets64strm
  Positions file is Arabidopsis_1.ref153positions
Offsets compression type: bitpack64
Allocating memory for ref offset pointers, kmer 15, interval 3...Attached new memory for /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb/Arabidopsis_1/Arabidopsis_1.ref153offsets64meta...done (134,217,744 bytes, 0.11 sec)
Allocating memory for ref offsets, kmer 15, interval 3...Attached new memory for /mnt/storage/sklab202/genome_files/FusionBloom_lib/gmapdb/Arabidopsis_1/Arabidopsis_1.ref153offsets64strm...done (181,880,672 bytes, 0.12 sec)
Pre-loading ref positions, kmer 15, interval 3......done (159,304,604 bytes, 0.01 sec)
Starting alignment
Processed 218 queries in 0.26 seconds (838.46 queries/sec)
Removed existing memory for shmid 262189
Removed existing memory for shmid 262188
[E::idx_find_and_load] Could not retrieve index file for 'FusionBloom-out/pavfinder/probes.bam'
5850:read_through-3-4925900-L-3-4938443-R:na probe failed: fusion subseq not mapped to either of the 2 genes
47355:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
16739:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
54446:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
40418:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
2424:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
26116:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
55756:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
33731:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
6689:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
44925:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
6116:fusion-2-8599317-R-2-8610884-L:na probe failed: probe aligned exclusively 100/100=1.00
3386:read_through-4-9535355-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
6731:fusion-4-13574949-R-4-13577652-L:na probe failed: probe aligned exclusively 100/100=1.00
174:read_through-5-15381717-L-5-15384985-R:na probe failed: fusion subseq not mapped to either of the 2 genes
40908:read_through-1-26870796-L-1-26897867-R:na probe failed: fusion subseq not mapped to either of the 2 genes
26645:read_through-1-22661091-L-1-22665289-R:na probe failed: fusion subseq not mapped to either of the 2 genes
21022:read_through-4-11804455-L-4-11818227-R:na probe failed: fusion subseq not mapped to either of the 2 genes
13137:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
30147:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
14552:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
19405:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
20660:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
14090:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
44454:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
1663:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
19755:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
27347:read_through-4-17168808-L-4-17172793-R:na probe failed: fusion subseq not mapped to either of the 2 genes
9294:read_through-1-21793016-L-1-21817787-R:na probe failed: fusion subseq not mapped to either of the 2 genes
2883:read_through-4-9522090-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
1573:read_through-4-9522090-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
19095:read_through-4-9522090-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
12635:read_through-4-9522090-L-4-9542182-R:na probe failed: fusion subseq not mapped to either of the 2 genes
15545:read_through-4-17942549-L-4-17952208-R:na probe failed: fusion subseq not mapped to either of the 2 genes
3185:read_through-5-15377740-L-5-15381583-R:na probe failed: fusion subseq not mapped to either of the 2 genes
13313:read_through-5-16496185-L-5-16500116-R:na probe failed: fusion subseq not mapped to either of the 2 genes
6876:read_through-2-9774999-L-2-9779239-R:na probe failed: fusion subseq not mapped to either of the 2 genes
21860:fusion-2-10785709-R-2-10789150-L:na probe failed: probe aligned exclusively 100/100=1.00
8585:fusion-4-13858422-R-4-13862206-L:na probe failed: probe aligned exclusively 100/100=1.00
11021:fusion-2-19063628-R-2-19067365-L:na probe failed: probe aligned exclusively 100/100=1.00
61158:read_through-5-5973052-L-5-5979541-R:na probe failed: fusion subseq not mapped to either of the 2 genes
20441:read_through-5-5579580-L-5-5582460-R:na probe failed: fusion subseq not mapped to either of the 2 genes
2639:fusion-5-2824570-R-5-2828153-L:na probe failed: probe aligned exclusively 100/100=1.00
47886:fusion-1-8446147-R-1-8466799-L:na probe failed: probe aligned exclusively 100/100=1.00
62442:read_through-5-16394740-R-5-16400279-L:na probe failed: fusion subseq not mapped to either of the 2 genes
49069:read_through-5-15378045-L-5-15381808-R:na probe failed: fusion subseq not mapped to either of the 2 genes
5346:read_through-2-9771021-L-2-9775074-R:na probe failed: fusion subseq not mapped to either of the 2 genes
**/mnt/storage/sklab202/software/pavfinder_env/lib64/python3.6/site-packages/Bio/Seq.py:2983: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  BiopythonWarning,**

real    16m6.724s
user    58m7.835s
sys 0m33.945s
`

think-o commented 3 years ago

Also, I have made some changes to the fusion-bloom. The script file does not generate index files for c2g.bam and c2t.bam. This gives an error that the index files could not be loaded but pavfinder still runs. Only after a while it returns a TypeError. I have made changes such that the index files for c2g.bam and c2t.bam are also generated. This does not give the error anymore. Well, I am not sure if that is what can be interpreted from it. These are the changes I made.

# c2g
$(outdir)/c2g.bam: $(assembly_outdir)/$(name).transcripts.filtered.fa
        source $(profile) && \
        time gmap -d $(GENOME) -D $(GMAPDB) $(assembly_outdir)/$(name).transcripts.filtered.fa -t $(NUM_THREADS) -f samse -n 0 | samtools view -bhS - -o $(outdir)/c2g_unsorted.bam
        samtools sort -m $(SAMTOOLS_SORT_MEM) $(outdir)/c2g_unsorted.bam > $(outdir)/c2g.bam
        time samtools index $(outdir)/c2g.bam

# c2t
$(outdir)/c2t.bam: $(assembly_outdir)/$(name).transcripts.filtered.fa
        source $(profile) && \
        time bwa mem -t $(NUM_THREADS) $(TRANSCRIPTS_FASTA) $(assembly_outdir)/$(name).transcripts.filtered.fa | samtools view -bhS -o $(outdir)/c2t_unsorted.bam
        samtools sort -m $(SAMTOOLS_SORT_MEM) $(outdir)/c2t_unsorted.bam > $(outdir)/c2t.bam
        time samtools index $(outdir)/c2t.bam

Also, while pavfinder runs and the transcripts get processed, some transcripts do not get mapped. It says

processing "transcript_number" 
"transcript_number" : cannot map to transcript

Sometimes

processing "transcript_number"
"transcript_number" : fusion not sense

Let me know if that can be neglected For the read length I have taken the highest of the first 100 reads of both the fastq files(PE Data) and then the highest of those. Do you think that would be apt?

readmanchiu commented 3 years ago

@think-o, Glad it's working now. Regarding the warnings on bam indices that coming from parsing subseqs.bam, and also c2g.bam and c2t.bam as you mentioned, I'm a bit surprised to see the complaints because indices are not required for accessing these bams, the script is iterating through them line-by-line to extract information. Actually I'm surprised you could generate indices for c2g and c2t because they are not sorted (by reference)? Wonder which software generated the warnings like [E::idx_find_and_load], was it samtools/pysam/gmap? Will need to do some investigation but as far as I can tell, for now, you can ignore these warnings

BioPython was used to discern whether the fusion is in-frame or not. Not every contig sequence is guaranteed to be a full reconstruction of the coding sequence, so I think you can ignore that warning as well.

You can also ignore the "cannot map" and "fusion not sense" errors, as not every contig can be mapped to a transcript, and sometimes fusion can be non-sense (the two transcripts are in opposite orientation)

The read length parameter is mainly for filtering out contigs too short (shorter than a read length), so it should be fine with what you chose (you don't trust fusions capture in short contigs anyway).

I would suggest taking the contig sequence or the probe sequence of some interesting fusions reported and blat in on UCSC to see if they make sense. That's something I normally do because there is always some false positives that escape the filtering done in PAVFinder.

think-o commented 3 years ago

Thank you so much fo clearing my doubts. Yeah I will take the contig sequence and see what I get from it. Thanks for resolving my issue. Been on it from the last week.

readmanchiu commented 3 years ago

yeah, issues usually come from the discrepancy between gtf and transcripts fasta. Hope you can discover some interesting events. Will close the issue now.

bcgsc / pavfinder

FusionBloom KeyError - while running pavfinder #16