RG handeling - Githubissues

JJBio commented 4 years ago

If I understood correctly, I cannot use lumpy express on a merged bam file with different RGs.

So, I have to have separate bam files (aligned with bwa mem -M -R ) and do this on each:

Sort & Index with Picard tools
Mark duplicates with Picard tools
Sort & Index with Picard tools
Extract the discordant paired-end alignments
Extract the split-read alignments
Sort discordants & splitters with samtools sort
Generate empirical insert size statistics for each bam file

Then run lumpy like this

lumpy \
    -mw 4 \
    -tt 0 \
    -pe id:sample,read_group:rg1,bam_file:sample.discordants.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \
    -pe id:sample,read_group:rg2,bam_file:sample.discordants.bam,histo_file:sample.lib2.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \
    -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \
    > sample.vcf

what I am confused about is this line: -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 Do I merge the splitters from the different RG into one bam file? And back_distance:10,weight:1,min_mapping_threshold:20 are default parameters that should be kept regardless of the RG?

And my last question (sorry for the many questions): What do I do when I have a complex design: So, multiple libraries with different insert sizes with multiple lanes. Should I feed all of them separately? Or do I only treat different libraries as different RGs in this case? Thanks so much!!!

ryanlayer commented 4 years ago

Each “-pe” will get its own histogram, so separating out the different libraries with different properties is a good move. Do not merge thing back together. Each sample should have its down “-pe” and “-sr”.

That said, you should use our new lumpy wrapper smoove

https://github.com/brentp/smoove

Then

call each sample separately
merge the calls into one set of SV sites
genotype those sites for each sample

Smoove makes all of those steps easy.

On Aug 5, 2019, at 10:41 AM, JJBio notifications@github.com wrote:

If I understood correctly, I cannot use lumpy express on a merged bam file with different RGs.

So, I have to have separate bam files (aligned with bwa mem -M -R ) and do this on each:

Sort & Index with Picard tools Mark duplicates with Picard tools Sort & Index with Picard tools Extract the discordant paired-end alignments Extract the split-read alignments Sort discordants & splitters with samtools sort Generate empirical insert size statistics for each bam file Then run lumpy like this

lumpy \ -mw 4 \ -tt 0 \ -pe id:sample,read_group:rg1,bam_file:sample.discordants.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \ -pe id:sample,read_group:rg2,bam_file:sample.discordants.bam,histo_file:sample.lib2.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 \ -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \

sample.vcf what I am confused about is this line: -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 Do I merge the splitters from the different RG into one bam file? And back_distance:10,weight:1,min_mapping_threshold:20 are default parameters that should be kept regardless of the RG?

And my last question (sorry for the many questions): What do I do when I have a complex design: So, multiple libraries with different insert sizes with multiple lanes. Should I feed all of them separately? Or do I only treat different libraries as different RGs in this case? Thanks so much!!!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

JJBio commented 4 years ago

Great tool - thanks so much! I will try it out.

I still have some questions how smoove handles the RGs but posted that now as a separate issue here, relating how to treat an individual sample that has data from more than one lib and lanes.

Again, thanks so much!

arq5x / lumpy-sv

RG handeling #312