Closed yanlina0205 closed 3 years ago
I think the error happened because the transcripts_fasta
and the gtf file disagree
Please try generate the transcripts_fasta
from the gtf using the extract_transcript_sequence.py
script provided and use it for the --transcripts_fasta
parameter:
extract_transcript_sequence.py ref/GRCh38.gtf.gz transcripts.fa hg38.fa --index --only_longest
Hope this solves the problem, Thanks!
Thanks for your reply. I generated the transcripts_fasta from the gtf using the extract_transcript_sequence.py script and rerun the PV to detect transcriptome structural variants. It seems that there is no wrong with the input and process:
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /asnas/wangqf_group/yanln/TAP/09_call_variants/partial_target.fa
[main] Real time: 0.813 sec; CPU: 0.024 sec
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 3 sequences (293 bp)...
[M::mem_process_seqs] Processed 3 reads in 0.001 CPU sec, 0.001 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /asnas/wangqf_group/yanln/TAP/09_call_variants/partial_target.fa /asnas/wangqf_group/yanln/TAP/09_call_variants/partial_query.fa
[main] Real time: 0.005 sec; CPU: 0.006 sec
[E::idx_find_and_load] Could not retrieve index file for '/asnas/wangqf_group/yanln/TAP/09_call_variants/partial_aln.bam'
But I'm not sure about the last sentence, maybe because it didn't build index of partial_aln.bam? Besides, maybe because I only detect transcriptome structural variants of one gene, it run so fast.
So does the output sv.bedpe
get generated? Do you see it with at least the header in it? Hopefully you see some events, but if it's just one gene, there may not be any.
If you are only running TAP on one target gene, that will be very fast.
Do you see the partial_target.fa
, partial_query.fa
and partial_aln.bam
in your output directory /asnas/wangqf_group/yanln/TAP/09_call_variants/
?
I'm suspecting they are all empty which is why it can't build an index. If that's the case, you can ignore the message.
Yes. The output sv.bedpe get generated and has the header in it. I find some other message when it run:
processing S302
S302:cannot map to transcript
Almost all of my contigs can't map to transcript, I guess that if it because I'm form the beginning? The target gene sequence has problem?
I extract target gene by this command:
bedtools getfasta -fi transcript.fasta -bed gene_mRNA.gff3 -name -fo gene_mRNA.fa
Then when I create bloomfilter, I only use the gene sequence, removing the CDS and exon, maybe I'm wrong?
Or I should extract target gene sequence from genome.fasta, not transcript.fasta?
The cannot map to transcript
message indicates the gapped alignment (c2g) of that contig cannot be mapped to a known transcript model in the supplied annotation.
So if all your contigs have this message, it's likely something went wrong.
How do you generate your c2g alignments? Do you use gmap to align against the reference?
You can blat your contig, S302 for example, to hg38 in UCSC genome browser to see what transcript it aligns to, check if it agrees with the alignment in c2g, and see if the corresponding transcript model exists in your gtf file?
Yes, I generate my c2g alignments by this command:
gmap_build -d ensembl_hg38 genome.fasta # build index
gsnap -d ensembl_hg38 --use-localdb=1 -t 10 -O --nofails --format=sam $contig > $out/${sample}_c2g.sam
samtools view -bS $out/${sample}_c2g.sam > $out/${sample}_c2g.bam
I blat my contig to hg38 in UCSC genome browser and it agrees with the alignment in c2g. The details are shown below: But it's strange that my target gene is in chr13, but the contig is in chr6.
gsnap
is for read alignment, you should use gmap
:
gmap -d <GENOME> -D <GMAPDB> <contigs.fa> -t <NUM_THREADS> -f samse -n 0 | samtools view -bhS - -o c2g.bam
the command is also coded in the fusion-bloom
script
Thanks you. I will try it again as you say.
The contig is short, i.e. not very high credibility. You can check some of your larger contigs, hopefully they should align to your target gene. If not, there may be something wrong with the read extraction step too.
I blat larger contigs to hg38 in UCSC genome browser and it's on chr13! Thank you very much for your patiently reply.
I found the problems. I didn't see the code of fusion-bloom before and run the process according to the paper. For instance, I assemble the fq with Tran-abyss, so the name of my contigs are S*. With the code of fusion-bloom, I succeeded found the mutations of target gene. At last, thank you very much, for developing the package and anwsering my questions!
Great, fusion-bloom is designed for you to specifically find fusions. Since you ran TAP, you should also get the results of map_splice.py
(e.g. novel_splicing.bedpe) If not, and you are also interested in novel splice variants for your target gene, you can run map_splice.py
:
map_splice.py <contigs_to_genome_bam> <contigs_fasta> <gtf> <genome_fasta> <outdir> --r2c <reads_to_contigs_bam>
you already have all the inputs for the script.
Also, tap2.py
uses RNABloom instead of Trans-ABySS for the assembly step, this may be the one to use if you want to use TAP again in future.
Thanks for your interest in our software!
I'm running PV to detect transcriptome structural variants, I have generated the, , and files corresponding to the samples.
And the command is show below:
find_sv_transcriptome.py --gbam c2g.bam --tbam c2t.bam --transcripts_fasta GRCh38_rna.fna --genome_index tools/share/hg38 hg38 --r2c r2c.bam contig.fa ref/GRCh38.gtf.gz hg38.fa outdir
But got the wrong message, the detail is:
However,I didn't find where discribe the attribute 'get_sequence'.