arq5x / lumpy-sv

lumpy: a general probabilistic framework for structural variant discovery
MIT License
309 stars 118 forks source link

Processing multiple samples #283

Open Jordi-V opened 5 years ago

Jordi-V commented 5 years ago

Hi Ryan,

I have a new question about processing multiple samples by LUMPY... I 've 808 wgs samples at 30X (human and ref genome which I used is hs37d5 like 1000G). I generate the discordant and split bams for each sample, and now I want to run lumpyexpress for all samples together. When I try to ran, it falls and I dont know why because dont give me any error... this is the command which I use: /apps/LUMPY/0.2.13/bin/lumpyexpress -B sample1.bam,sample2.bam,sample3.bam...to 808 -S sample1.splitters.sorted.bam,sample2.splitters.sorted.bam,sample3.splitters.sorted.bam... to 808 -D sample1.discordants.sorted.bam,sample2.discordants.sorted.bam,sample3.discordants.sorted.bam... to 808 -K /apps/LUMPY/0.2.13/bin/lumpyexpress.config -T tmp -o all_samples_Lumpy.vcf

There are any error?? all samples are sorted and "," separated. My second question is: do you consider if I run all samples together is too much for LUMPY? I see in post #66 you recommend run sample by sample and after use the scripts "scripts/l_sort.py" and "scripts/l_merge.py"... but I dont know If for my case will be ok... I want to do a reference panel like 1000G with my samples and I need to find recurrences between all 808 samples... So which is the best strategy?? and how I have to run these scripts??

Thanks for your time and happy new year!

Jordi

ryanlayer commented 5 years ago

I would suggest running lumpy on each sample individually, then merging all of them together with lsort and lmerge from svtools https://github.com/hall-lab/svtools

The process is: call and genotype (with svtyper https://github.com/hall-lab/svtyper) SVs on each sample. Make sure to pass the -P option to lumpyexpress sort all those VCFs into a single intermediate file with lsort merge into one SV callset with lmerge genotype those SVs against all of your samples with svtype.

You can also use smoove https://github.com/brentp/smoove, which handles all of the prior steps very well

On Thu, Jan 3, 2019 at 4:15 AM Jordi Valls Margarit < notifications@github.com> wrote:

Hi Ryan,

I have a new question about processing multiple samples by LUMPY... I 've 808 wgs samples at 30X (human and ref genome which I used is hs37d5 like 1000G). I generate the discordant and split bams for each sample, and now I want to run lumpyexpress for all samples together. When I try to ran, it falls and I dont know why because dont give me any error... this is the command which I use: /apps/LUMPY/0.2.13/bin/lumpyexpress -B sample1.bam,sample2.bam,sample3.bam...to 808 -S sample1.splitters.sorted.bam,sample2.splitters.sorted.bam,sample3.splitters.sorted.bam... to 808 -D sample1.discordants.sorted.bam,sample2.discordants.sorted.bam,sample3.discordants.sorted.bam... to 808 -K /apps/LUMPY/0.2.13/bin/lumpyexpress.config -T tmp -o all_samples_Lumpy.vcf

There are any error?? all samples are sorted and "," separated. My second question is: do you consider if I run all samples together is too much for LUMPY? I see in post #66 https://github.com/arq5x/lumpy-sv/issues/66 you recommend run sample by sample and after use the scripts "scripts/l_sort.py" and "scripts/l_merge.py"... but I dont know If for my case will be ok... I want to do a reference panel like 1000G with my samples and I need to find recurrences between all 808 samples... So which is the best strategy?? and how I have to run these scripts??

Thanks for your time and happy new year!

Jordi

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/283, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUS-qbbFPjzRKgcT-nn96LvTHYTjXks5u_eZhgaJpZM4Zn62e .

Jordi-V commented 5 years ago

Hi Ryan,

I have some problem to run lsort... I run lumpy sample by sample with -P option as you recommended me. And I have the output of all my 808 samples without any error. When I try to run svtools, with the command:

svtools lsort -f all_vcfs_lumpygz | bgzip -c > GCAT_all_samples_lumpy_sort.vcf.gz

I get the header with all my samples but without any call... So I dont know If I'm doing something wrong... the -f is the option to put a file with all my samples and paths respectively... So do you know what happen?? Any recommendation??

Thanks for your help

Jordi

ryanlayer commented 5 years ago

Do not bgzip lsort.

On Mon, Jan 7, 2019 at 9:26 AM Jordi Valls Margarit < notifications@github.com> wrote:

Hi Ryan,

I have some problem to run lsort... I run lumpy sample by sample with -P option as you recommended me. And I have the output of all my 808 samples without any error. When I try to run svtools, with the command:

svtools lsort -f all_vcfs_lumpygz | bgzip -c > GCAT_all_samples_lumpy_sort.vcf.gz

I get the header with all my samples but without any call... So I dont know If I'm doing something wrong... the -f is the option to put a file with all my samples and paths respectively... So do you know what happen?? Any recommendation??

Thanks for your help

Jordi

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/283#issuecomment-451992307, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUbeCpCi0FcOHTQLSpnE6UEJ8Mu-Fks5vA3UrgaJpZM4Zn62e .

Jordi-V commented 5 years ago

Hi Ryan,

first of all thanks for your reply, help me a lot! Now I dont understand the output of SVTOOLS...

I run lumpyexpress with -P option. After that I genotype each sample with svtyper (if not the svtools lsort doesn't works) I ran svtools lsort, I supose that svtools lsort, sort the file by kind of variant, because in the first lines we found BND only. Afther that apperar others SV. But I dont understand the header of lsort, because the column format disappear, and the sample names are in the header, like you can see in above...

fileformat=VCFv4.2

reference=

source=LUMPY

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

ALT=

ALT=

ALT=

ALT=

ALT=

ALT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

SAMPLE=

SAMPLE=

This is correct??? The lsort step, sort the vcf by kind of variant?? because if I have to sort by chromosome and position, independently of kind of variant, this doesn't works...

Finally I ran the lmerge step, and I dont understand the output...

the header is exactly the same as lsort

fileformat=VCFv4.2

reference=

source=LUMPY

SAMPLE=

SAMPLE=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

ALT=

ALT=

ALT=

ALT=

ALT=

ALT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO

change a few things, like the position of samples names... In addition, there arent the format column... I suppose that in SNAME= flag, I will have to find if the two samples share the same variant or not, because if not I dont understand what do exactly the lmerge...I only found one sample, which comes from sample1 or 2 respectively... On the other hand, I dont know if the merge step works correctly, because in the example posted above, two samples share the same variant (output get it from lumpyexpress) and the sample2 are not in the lmerge output: sample1: 1 829170 1 N . . SVTYPE=DEL;STRANDS=+-:4;SVLEN=-35;END=829205;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13 GT:SU:PE:SR ./.:4:0:4 sample2: 1 829170 2 N . . SVTYPE=DEL;STRANDS=+-:4;SVLEN=-35;END=829205;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13 GT:SU:PE:SR ./.:4:0:4

The position found in my output after run lmerge is: 1 829170 3548 N 295.44 . SVTYPE=DEL;SVLEN=-35;END=829205;STRANDS=+-:4;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;SNAME=sample1:2;ALG=PROD

so how works lmerge??? Now I have to run svtyper with all my samples together in a vcf, but i dont know if I will get a vcf with all my samples in the header and know which samples share a same SV...

All steps are correct executed? lmerge works well?

Sorry for too much questions....

Thanks for your time

Jordi

ryanlayer commented 5 years ago

If you didn't genotype things then you need to use the -r option.

On Tue, Jan 8, 2019 at 4:51 AM Jordi Valls Margarit < notifications@github.com> wrote:

Hi Ryan,

first of all thanks for your reply, help me a lot! Now I dont understand the output of SVTOOLS...

I run lumpyexpress with -P option. After that I genotype each sample with svtyper (if not the svtools lsort doesn't works) I ran svtools lsort, I supose that svtools lsort, sort the file by kind of variant, because in the first lines we found BND only. Afther that apperar others SV. But I dont understand the header of lsort, because the column format disappear, and the sample names are in the header, like you can see in above...

fileformat=VCFv4.2

reference=

source=LUMPY

INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural

variant">

INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length

between REF and ALT alleles">

INFO=<ID=END,Number=1,Type=Integer,Description="End position of the

variant described in this record">

INFO=<ID=STRANDS,Number=.,Type=String,Description="Strand orientation of

the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)">

INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural

variation">

INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval

around POS for imprecise variants">

INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval

around END for imprecise variants">

INFO=<ID=CIPOS95,Number=2,Type=Integer,Description="Confidence interval

(95%) around POS for imprecise variants">

INFO=<ID=CIEND95,Number=2,Type=Integer,Description="Confidence interval

(95%) around END for imprecise variants">

INFO=

INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated

to breakend">

INFO=<ID=SECONDARY,Number=0,Type=Flag,Description="Secondary breakend in

a multi-line variants">

INFO=<ID=SU,Number=.,Type=Integer,Description="Number of pieces of

evidence supporting the variant across all samples">

INFO=<ID=PE,Number=.,Type=Integer,Description="Number of paired-end

reads supporting the variant across all samples">

INFO=<ID=SR,Number=.,Type=Integer,Description="Number of split reads

supporting the variant across all samples">

INFO=<ID=BD,Number=.,Type=Integer,Description="Amount of BED evidence

supporting the variant across all samples">

INFO=<ID=EV,Number=.,Type=String,Description="Type of LUMPY evidence

contributing to the variant call">

INFO=<ID=PRPOS,Number=.,Type=String,Description="LUMPY probability curve

of the POS breakend">

INFO=<ID=PREND,Number=.,Type=String,Description="LUMPY probability curve

of the END breakend">

INFO=

INFO=<ID=ALG,Number=1,Type=String,Description="Evidence PDF aggregation

algorithm">

ALT=

ALT=

ALT=

ALT=

ALT=

ALT=

FORMAT=

FORMAT=<ID=SU,Number=1,Type=Integer,Description="Number of pieces of

evidence supporting the variant">

FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end

reads supporting the variant">

FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads

supporting the variant">

FORMAT=<ID=BD,Number=1,Type=Integer,Description="Amount of BED evidence

supporting the variant">

FORMAT=

FORMAT=<ID=SQ,Number=1,Type=Float,Description="Phred-scaled probability

that this site is variant (non-reference in this sample">

FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood,

log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">

FORMAT=

FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele

observation count, with partial observations recorded fractionally">

FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele

observations, with partial observations recorded fractionally">

FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of

reference observations">

FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of

alternate observations">

FORMAT=<ID=RS,Number=1,Type=Integer,Description="Reference allele

split-read observation count, with partial observations recorded fractionally">

FORMAT=<ID=AS,Number=A,Type=Integer,Description="Alternate allele

split-read observation count, with partial observations recorded fractionally">

FORMAT=<ID=ASC,Number=A,Type=Integer,Description="Alternate allele

clipped-read observation count, with partial observations recorded fractionally">

FORMAT=<ID=RP,Number=1,Type=Integer,Description="Reference allele

paired-end observation count, with partial observations recorded fractionally">

FORMAT=<ID=AP,Number=A,Type=Integer,Description="Alternate allele

paired-end observation count, with partial observations recorded fractionally">

FORMAT=<ID=AB,Number=A,Type=Float,Description="Allele balance, fraction

of observations from alternate allele, QA/(QR+QA)">

SAMPLE=

SAMPLE=

This is correct??? The lsort step, sort the vcf by kind of variant?? because if I have to sort by chromosome and position, independently of kind of variant, this doesn't works...

Finally I ran the lmerge step, and I dont understand the output...

the header is exactly the same as lsort

fileformat=VCFv4.2

reference=

source=LUMPY

SAMPLE=

SAMPLE=

INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural

variant">

INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length

between REF and ALT alleles">

INFO=<ID=END,Number=1,Type=Integer,Description="End position of the

variant described in this record">

INFO=<ID=STRANDS,Number=.,Type=String,Description="Strand orientation of

the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)">

INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural

variation">

INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval

around POS for imprecise variants">

INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval

around END for imprecise variants">

INFO=<ID=CIPOS95,Number=2,Type=Integer,Description="Confidence interval

(95%) around POS for imprecise variants">

INFO=<ID=CIEND95,Number=2,Type=Integer,Description="Confidence interval

(95%) around END for imprecise variants">

INFO=

INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated

to breakend">

INFO=<ID=SECONDARY,Number=0,Type=Flag,Description="Secondary breakend in

a multi-line variants">

INFO=<ID=SU,Number=.,Type=Integer,Description="Number of pieces of

evidence supporting the variant across all samples">

INFO=<ID=PE,Number=.,Type=Integer,Description="Number of paired-end

reads supporting the variant across all samples">

INFO=<ID=SR,Number=.,Type=Integer,Description="Number of split reads

supporting the variant across all samples">

INFO=<ID=BD,Number=.,Type=Integer,Description="Amount of BED evidence

supporting the variant across all samples">

INFO=<ID=EV,Number=.,Type=String,Description="Type of LUMPY evidence

contributing to the variant call">

INFO=<ID=PRPOS,Number=.,Type=String,Description="LUMPY probability curve

of the POS breakend">

INFO=<ID=PREND,Number=.,Type=String,Description="LUMPY probability curve

of the END breakend">

INFO=

INFO=<ID=ALG,Number=1,Type=String,Description="Evidence PDF aggregation

algorithm">

ALT=

ALT=

ALT=

ALT=

ALT=

ALT=

FORMAT=

FORMAT=<ID=SU,Number=1,Type=Integer,Description="Number of pieces of

evidence supporting the variant">

FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end

reads supporting the variant">

FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads

supporting the variant">

FORMAT=<ID=BD,Number=1,Type=Integer,Description="Amount of BED evidence

supporting the variant">

FORMAT=

FORMAT=<ID=SQ,Number=1,Type=Float,Description="Phred-scaled probability

that this site is variant (non-reference in this sample">

FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood,

log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">

FORMAT=

FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele

observation count, with partial observations recorded fractionally">

FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele

observations, with partial observations recorded fractionally">

FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of

reference observations">

FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of

alternate observations">

FORMAT=<ID=RS,Number=1,Type=Integer,Description="Reference allele

split-read observation count, with partial observations recorded fractionally">

FORMAT=<ID=AS,Number=A,Type=Integer,Description="Alternate allele

split-read observation count, with partial observations recorded fractionally">

FORMAT=<ID=ASC,Number=A,Type=Integer,Description="Alternate allele

clipped-read observation count, with partial observations recorded fractionally">

FORMAT=<ID=RP,Number=1,Type=Integer,Description="Reference allele

paired-end observation count, with partial observations recorded fractionally">

FORMAT=<ID=AP,Number=A,Type=Integer,Description="Alternate allele

paired-end observation count, with partial observations recorded fractionally">

FORMAT=<ID=AB,Number=A,Type=Float,Description="Allele balance, fraction

of observations from alternate allele, QA/(QR+QA)">

CHROM POS ID REF ALT QUAL FILTER INFO

change a few things, like the position of samples names... In addition, there arent the format column... I suppose that in SNAME= flag, I will have to find if the two samples share the same variant or not, because if not I dont understand what do exactly the lmerge...I only found one sample, which comes from sample1 or 2 respectively... On the other hand, I dont know if the merge step works correctly, because in the example posted above, two samples share the same variant (output get it from lumpyexpress) and the sample2 are not in the lmerge output: sample1: 1 829170 1 N . . SVTYPE=DEL;STRANDS=+-:4;SVLEN=-35;END=829205;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13 GT:SU:PE:SR ./.:4:0:4 sample2: 1 829170 2 N . . SVTYPE=DEL;STRANDS=+-:4;SVLEN=-35;END=829205;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13 GT:SU:PE:SR ./.:4:0:4

The position found in my output after run lmerge is: 1 829170 3548 N 295.44 . SVTYPE=DEL;SVLEN=-35;END=829205;STRANDS=+-:4;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;SNAME=sample1:2;ALG=PROD

so how works lmerge??? Now I have to run svtyper with all my samples together in a vcf, but i dont know if I will get a vcf with all my samples in the header and know which samples share a same SV...

All steps are correct executed? lmerge works well?

Sorry for too much questions....

Thanks for your time

Jordi

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/283#issuecomment-452272012, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUQlah6pUTBRh7zNbizXp_AOF7Im5ks5vBIZTgaJpZM4Zn62e .

Jordi-V commented 5 years ago

Dear Ryan,

I have a new question about svtyper... I have 808 samples at 30X, I ran Lumpy express for each sample individually, after that I use svtyper in order to genotype all samples individually, without merging. On the other hand, after ran lumpyexpress, I merge all vcfs using svtools the lsort and merge, and I use svtyper to genotype all samples together, the command is like this:

svtyper -B z.bam,z.bam,l.bam .... -i svtoolls_merge.vcf -l all_bam.json > all.vcf

I dont know If I ran corretly the svtyper or is better ran svtyper for sample individually using the consensus vcf from lumpyexpress...

Finally I compare the results of two vcfs, one using svtyper for each sample indiviually, without merge all vcf together and other as I show above.

My surprise comes when I found more variants in the vcf from all samples together for a sample X than ran individually.

Could you explain me why?? The lumpyexpress dont detect all variants from a sample? and svtyper genotype these regions and found an heterozygous? I know when use a vcf from all samples the svtyper try to genotype all positions from this vcf from each sample, but I hope to find the places where lumpyexpress dont find anything a genotype 0/0 no 0/1....

Thanks for your help

Jordi

Jordi-V commented 5 years ago

Dear Ryan,

Could you tell me the objective of regenotype all samples separately when we have all vcf merged?? Why when I regenotype I found more variants than without merge??? If I dont merge all samples and re-genotype with svtyper is a wrong practice??

Thanks for your help

Jordi

ryanlayer commented 5 years ago

On Feb 15, 2019, at 3:37 AM, Jordi Valls Margarit notifications@github.com wrote:

Dear Ryan,

I have a new question about svtyper... I have 808 samples at 30X, I ran Lumpy express for each sample individually, after that I use svtyper in order to genotype all samples individually, without merging. On the other hand, after ran lumpyexpress, I merge all vcfs using svtools the lsort and merge, and I use svtyper to genotype all samples together, the command is like this:

svtyper -B z.bam,z.bam,l.bam .... -i svtoolls_merge.vcf -l all_bam.json > all.vcf

I dont know If I ran corretly the svtyper or is better ran svtyper for sample individually using the consensus vcf from lumpyexpress...

That looks right.

Finally I compare the results of two vcfs, one using svtyper for each sample indiviually, without merge all vcf together and other as I show above.

My surprise comes when I found more variants in the vcf from all samples together for a sample X than ran individually.

Could you explain me why??

This is my we joint call variants. The SV may have been missed in one sample due to a drop in coverage or some other sample-specific noise. But since it was found in anther sample and joint genotypes it was recovered. Discovery is harder than genotyping because the tolerance for noise is so much lower. The lumpyexpress dont detect all variants from a sample? and svtyper genotype these regions and found an heterozygous? I know when use a vcf from all samples the svtyper try to genotype all positions from this vcf from each sample, but I hope to find the places where lumpyexpress dont find anything a genotype 0/0 no 0/1....

Can you explain your experimental goal again? I don’t quite understand. Thanks for your help

Jordi

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ryanlayer commented 5 years ago

I hope I answers this in my reply to your earlier email.

On Feb 21, 2019, at 2:43 AM, Jordi Valls Margarit notifications@github.com wrote:

Why

Jordi-V commented 5 years ago

Hi Ryan,

thanks a lot for your reply. My last question was, the lumpyexpress dont detect all variants for each sample? so I dont understand why svtyper genotype these regions and found heterozygous, because in the first step dont find anything and finally in the re-genotyping step find new variants...

Yes you answer my question, so If I'm correctly, to re-genotyping all my samples with a merge vcf is "only" for checking if we miss some variants in the first step for each sample, even more we get the genotype in regions where other samples get a variant... If we dont make this final step, (merging and re-genotype), we will loose sensitivity but the precision will be better, because in Discovery step the filters are harder...

I'm right?

Thanks a lot for your time

Jordi

ryanlayer commented 5 years ago

On Thu, Feb 21, 2019 at 7:49 AM Jordi Valls Margarit < notifications@github.com> wrote:

Hi Ryan,

thanks a lot for your reply. My last question was, the lumpyexpress dont detect all variants for each sample? so I dont understand why svtyper genotype these regions and found heterozygous, because in the first step dont find anything and finally in the re-genotyping step find new variants...

Yes you answer my question, so If I'm correctly, to re-genotyping all my samples with a merge vcf is "only" for checking if we miss some variants in the first step for each sample, even more we get the genotype in regions where other samples get a variant... If we dont make this final step, (merging and re-genotype), we will loose sensitivity but the precision will be better, because in Discovery step the filters are harder...

I'm right?

Correct!

Thanks a lot for your time

Jordi

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/283#issuecomment-466027376, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUdYohP6I80UdpW-k75dzbqV9G4l-ks5vPrH7gaJpZM4Zn62e .