arq5x / lumpy-sv

lumpy: a general probabilistic framework for structural variant discovery
MIT License
305 stars 118 forks source link

RG handeling #312

Open JJBio opened 4 years ago

JJBio commented 4 years ago

If I understood correctly, I cannot use lumpy express on a merged bam file with different RGs.

So, I have to have separate bam files (aligned with bwa mem -M -R ) and do this on each:

Then run lumpy like this

lumpy \
    -mw 4 \
    -tt 0 \
    -pe id:sample,read_group:rg1,bam_file:sample.discordants.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \
    -pe id:sample,read_group:rg2,bam_file:sample.discordants.bam,histo_file:sample.lib2.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \
    -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \
    > sample.vcf

what I am confused about is this line: -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 Do I merge the splitters from the different RG into one bam file? And back_distance:10,weight:1,min_mapping_threshold:20 are default parameters that should be kept regardless of the RG?

And my last question (sorry for the many questions): What do I do when I have a complex design: So, multiple libraries with different insert sizes with multiple lanes. Should I feed all of them separately? Or do I only treat different libraries as different RGs in this case? Thanks so much!!!

ryanlayer commented 4 years ago

Each “-pe” will get its own histogram, so separating out the different libraries with different properties is a good move. Do not merge thing back together. Each sample should have its down “-pe” and “-sr”.

That said, you should use our new lumpy wrapper smoove

https://github.com/brentp/smoove

Then

Smoove makes all of those steps easy.

On Aug 5, 2019, at 10:41 AM, JJBio notifications@github.com wrote:

If I understood correctly, I cannot use lumpy express on a merged bam file with different RGs.

So, I have to have separate bam files (aligned with bwa mem -M -R ) and do this on each:

Sort & Index with Picard tools Mark duplicates with Picard tools Sort & Index with Picard tools Extract the discordant paired-end alignments Extract the split-read alignments Sort discordants & splitters with samtools sort Generate empirical insert size statistics for each bam file Then run lumpy like this

lumpy \ -mw 4 \ -tt 0 \ -pe id:sample,read_group:rg1,bam_file:sample.discordants.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \ -pe id:sample,read_group:rg2,bam_file:sample.discordants.bam,histo_file:sample.lib2.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \ -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \

sample.vcf what I am confused about is this line: -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 Do I merge the splitters from the different RG into one bam file? And back_distance:10,weight:1,min_mapping_threshold:20 are default parameters that should be kept regardless of the RG?

And my last question (sorry for the many questions): What do I do when I have a complex design: So, multiple libraries with different insert sizes with multiple lanes. Should I feed all of them separately? Or do I only treat different libraries as different RGs in this case? Thanks so much!!!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

JJBio commented 4 years ago

Great tool - thanks so much! I will try it out.

I still have some questions how smoove handles the RGs but posted that now as a separate issue here, relating how to treat an individual sample that has data from more than one lib and lanes.

Again, thanks so much!