bcgsc / pavfinder

:mag: Post Assembly Variants Finder
Other
17 stars 5 forks source link

AttributeError: 'NoneType' object has no attribute 'get_sequence' #10

Closed yanlina0205 closed 3 years ago

yanlina0205 commented 3 years ago

I'm running PV to detect transcriptome structural variants, I have generated the , , and files corresponding to the samples. And the command is show below: find_sv_transcriptome.py --gbam c2g.bam --tbam c2t.bam --transcripts_fasta GRCh38_rna.fna --genome_index tools/share/hg38 hg38 --r2c r2c.bam contig.fa ref/GRCh38.gtf.gz hg38.fa outdir

But got the wrong message, the detail is:

Traceback (most recent call last):
  File "/xtdisk/wangqf_group/yanln/ENTER/lib/python3.8/site-packages/virtualenv/bin/find_sv_transcriptome.py", line 259, in <module>
    main()
  File "/xtdisk/wangqf_group/yanln/ENTER/lib/python3.8/site-packages/virtualenv/bin/find_sv_transcriptome.py", line 211, in main
    events['via_transcripts'], mappings['via_transcripts'] = sf.find_events(tbam,
  File "/xtdisk/wangqf_group/yanln/ENTER/lib/python3.8/site-packages/virtualenv/lib/python3.8/site-packages/pavfinder/transcriptome/sv_finder.py", line 340, in find_events
    new_aligns = self.map_partial_aligns(partial_aligns, query_fasta, target_type, max_unaligned_len=max_novel_len)
  File "/xtdisk/wangqf_group/yanln/ENTER/lib/python3.8/site-packages/virtualenv/lib/python3.8/site-packages/pavfinder/transcriptome/sv_finder.py", line 1113, in map_partial_aligns
    target_seq = transcript.get_sequence(self.genome_fasta)
AttributeError: 'NoneType' object has no attribute 'get_sequence'

However,I didn't find where discribe the attribute 'get_sequence'.

readmanchiu commented 3 years ago

I think the error happened because the transcripts_fasta and the gtf file disagree Please try generate the transcripts_fasta from the gtf using the extract_transcript_sequence.py script provided and use it for the --transcripts_fasta parameter:

extract_transcript_sequence.py ref/GRCh38.gtf.gz transcripts.fa hg38.fa --index --only_longest

Hope this solves the problem, Thanks!

yanlina0205 commented 3 years ago

Thanks for your reply. I generated the transcripts_fasta from the gtf using the extract_transcript_sequence.py script and rerun the PV to detect transcriptome structural variants. It seems that there is no wrong with the input and process:

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /asnas/wangqf_group/yanln/TAP/09_call_variants/partial_target.fa
[main] Real time: 0.813 sec; CPU: 0.024 sec
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 3 sequences (293 bp)...
[M::mem_process_seqs] Processed 3 reads in 0.001 CPU sec, 0.001 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /asnas/wangqf_group/yanln/TAP/09_call_variants/partial_target.fa /asnas/wangqf_group/yanln/TAP/09_call_variants/partial_query.fa
[main] Real time: 0.005 sec; CPU: 0.006 sec
[E::idx_find_and_load] Could not retrieve index file for '/asnas/wangqf_group/yanln/TAP/09_call_variants/partial_aln.bam'

But I'm not sure about the last sentence, maybe because it didn't build index of partial_aln.bam? Besides, maybe because I only detect transcriptome structural variants of one gene, it run so fast.

readmanchiu commented 3 years ago

So does the output sv.bedpe get generated? Do you see it with at least the header in it? Hopefully you see some events, but if it's just one gene, there may not be any.

If you are only running TAP on one target gene, that will be very fast.

Do you see the partial_target.fa, partial_query.fa and partial_aln.bam in your output directory /asnas/wangqf_group/yanln/TAP/09_call_variants/?

I'm suspecting they are all empty which is why it can't build an index. If that's the case, you can ignore the message.

yanlina0205 commented 3 years ago

Yes. The output sv.bedpe get generated and has the header in it. I find some other message when it run:

processing S302
S302:cannot map to transcript

Almost all of my contigs can't map to transcript, I guess that if it because I'm form the beginning? The target gene sequence has problem? I extract target gene by this command: bedtools getfasta -fi transcript.fasta -bed gene_mRNA.gff3 -name -fo gene_mRNA.fa Then when I create bloomfilter, I only use the gene sequence, removing the CDS and exon, maybe I'm wrong? Or I should extract target gene sequence from genome.fasta, not transcript.fasta?

readmanchiu commented 3 years ago

The cannot map to transcript message indicates the gapped alignment (c2g) of that contig cannot be mapped to a known transcript model in the supplied annotation. So if all your contigs have this message, it's likely something went wrong. How do you generate your c2g alignments? Do you use gmap to align against the reference? You can blat your contig, S302 for example, to hg38 in UCSC genome browser to see what transcript it aligns to, check if it agrees with the alignment in c2g, and see if the corresponding transcript model exists in your gtf file?

yanlina0205 commented 3 years ago

Yes, I generate my c2g alignments by this command:

gmap_build -d ensembl_hg38 genome.fasta      # build index
gsnap -d ensembl_hg38 --use-localdb=1 -t 10 -O --nofails --format=sam $contig > $out/${sample}_c2g.sam
samtools view -bS $out/${sample}_c2g.sam > $out/${sample}_c2g.bam
yanlina0205 commented 3 years ago

I blat my contig to hg38 in UCSC genome browser and it agrees with the alignment in c2g. The details are shown below: image image image But it's strange that my target gene is in chr13, but the contig is in chr6.

readmanchiu commented 3 years ago

gsnap is for read alignment, you should use gmap:

gmap -d <GENOME> -D <GMAPDB> <contigs.fa> -t <NUM_THREADS> -f samse -n 0 | samtools view -bhS - -o c2g.bam

the command is also coded in the fusion-bloom script

yanlina0205 commented 3 years ago

Thanks you. I will try it again as you say.

readmanchiu commented 3 years ago

The contig is short, i.e. not very high credibility. You can check some of your larger contigs, hopefully they should align to your target gene. If not, there may be something wrong with the read extraction step too.

yanlina0205 commented 3 years ago

I blat larger contigs to hg38 in UCSC genome browser and it's on chr13! Thank you very much for your patiently reply.

yanlina0205 commented 3 years ago

I found the problems. I didn't see the code of fusion-bloom before and run the process according to the paper. For instance, I assemble the fq with Tran-abyss, so the name of my contigs are S*. With the code of fusion-bloom, I succeeded found the mutations of target gene. At last, thank you very much, for developing the package and anwsering my questions!

readmanchiu commented 3 years ago

Great, fusion-bloom is designed for you to specifically find fusions. Since you ran TAP, you should also get the results of map_splice.py (e.g. novel_splicing.bedpe) If not, and you are also interested in novel splice variants for your target gene, you can run map_splice.py:

map_splice.py <contigs_to_genome_bam> <contigs_fasta> <gtf> <genome_fasta> <outdir> --r2c <reads_to_contigs_bam> 

you already have all the inputs for the script.

Also, tap2.py uses RNABloom instead of Trans-ABySS for the assembly step, this may be the one to use if you want to use TAP again in future.

Thanks for your interest in our software!