Run PV to detect transcriptome structural variants

yanlina0205 commented 3 years ago

I'm running PV to detect transcriptome structural variants, but I'm not sure about some parameters of this command: find_sv_transcriptome.py --gbam <contigs_to_genome_bam> --tbam <contigs_to_transcripts_bam> --transcripts_fasta <indexed_transcripts_fasta> --genome_index <GMAP index genome directory and name> --r2c <reads_to_contigs_bam> <contigs_fasta> <gtf> <genome_fasta> <outdir>

I have generated the <contigs_to_genome_bam>, <contigs_to_transcripts_bam>, <contigs_fasta> and <reads_to_contigs_bam> files corresponding to the samples.
But I don't know how can I input the paired reads,such as A_1.fa, A_2.fa to <indexed_transcripts_fasta>, is it must be indexed by samtools or other softwares?
Or maybe the <indexed_transcripts_fasta> means the files after rawdata aligned to ref ?
Is <gtf> means the genome.gtf?
Does the <outdir> can be the outdir/predix?

Thank you.

kmnip commented 3 years ago

Hi @yanlina0205 ,

If you have paired end RNA-seq read files, then I recommend using the fusion-bloom make script, which runs the whole pipeline from transcriptome assembly, alignments, to find_sv_transcriptome.py.

To answer your questions: The FASTA index can be generated with samtools faidx. You will see a *.fai file generated for your FASTA file.

The find_sv_transcriptome.py script expects the following:

query_fasta - de novo transcriptome assembly FASTA of your reads (such as those from RNA-Bloom)
gtf - transcript annotation GTF (such as those from Ensembl, UCSC, etc.)
genome_fasta - reference genome FASTA file
outdir - directory path where output files will be generated (i.e. not a prefix)
--tbam - BAM file of query sequences aligned to reference transcripts
--gbam - BAM file of query sequences aligned to reference genome
--r2c - BAM file of reads aligned to transcriptome assembly
--transcripts_fasta - reference transcript sequences
--genome_index - GMAP index directory and name for reference genome

@readmanchiu can correct if I am wrong. Hope that helps!

yanlina0205 commented 3 years ago

Thank you! I will try it as you said.

readmanchiu commented 3 years ago

Thanks @kmnip for answering for me, somehow this slipped through my emails Yes, the descriptions are all correct. I think the common problem is the mismatch between the gtf file and the transcripts fasta file, which I think is the cause of the next issue you reported next The sv events detected are referenced by the transcript ids provided in the gtf file, and in order to extract the transcript sequences, the transcripts fasta have to have the same ids. So I provided the extract_transcript_sequence.py script in the package to generate the transcripts fasta from the gtf, to make sure this is the case. Hopefully this will solve the problem you encountered

Thanks for reporting the issue

bcgsc / pavfinder

Run PV to detect transcriptome structural variants #9