arq5x / lumpy-sv

lumpy: a general probabilistic framework for structural variant discovery
MIT License
307 stars 119 forks source link

Use of soft clipped reads in Lumpy #328

Open lee039 opened 4 years ago

lee039 commented 4 years ago

Dear,

I used Lumpy via Smoove to call CNVs in a population of ~600 samples. First, I used 200 high coverage samples to discover CNVs, and then genotyped the sites in the entire population.

In the CNV data set, I have one frequent duplication. This duplication is a tandem duplication of multiple copies, so the start and end positions are very evident in IGV, and realized that the ending position provided by Lumpy is not precise.

The coordinate from Lumpy is chr17:10,049,651-10,061,320 (This is a pseudo position.) The actual ending position in IGV is 10,061,434. (114bp difference). In the true ending position shown in IGV, there are many soft clipped reads, which will enable Lumpy to find the coordinates at a base-pair resolution, but it seems like Lumpy did not use them somehow...

I checked some output files. The initial CNV discovery in ~200 high coverage samples generated vcf files for each sample. In these files, this duplication INFO is as follow (example from one sample): CIPOS=-523,29;CIEND=-30,638;CIPOS95=-109,1;CIEND95=-4,75 So, in this stage, the true end position was within CIEND range.

Afterwards, I re-genotyped the 600 samples, and in the final vcf file, the INFO field is as follow: CIPOS=-140,2;CIEND=-17,99;CIPOS95=0,0;CIEND95=-1,1;SU=6362;PE=6362;SR=0; Lumpy CIEND does not include the true end position. CIEND95 clearly eliminated the true end position.

I am curious about a couple of things.

  1. how could Limpy give CIEND95=-1,1 (to me looks like a very high resolution), relying solely on PE evidence? I thought either soft clipped reads or split reads evidence is crucial for ascertaining the st/end coordinates.
  2. would there be a possibility to enable Lumpy to understand the true ending position?

image It is the IGV screenshot. In the true ending position, many soft clipped reads are available. The position marked with red is where Lumpy thinks the duplication ends.

ryanlayer commented 4 years ago

On Mar 11, 2020, at 2:52 AM, Limitto notifications@github.com wrote:

 Dear,

I used Lumpy via Smoove to call CNVs in a population of ~600 samples. First, I used 200 high coverage samples to discover CNVs, and then genotyped the sites in the entire population.

In the CNV data set, I have one frequent duplication. This duplication is a tandem duplication of multiple copies, so the start and end positions are very evident in IGV, and realized that the ending position provided by Lumpy is not precise.

The coordinate from Lumpy is chr6:10,049,651-10,061,320 (This is a pseudo position.) The actual ending position in IGV is 10,061,434. (114bp difference). In the true ending position shown in IGV, there are many soft clipped reads, which will enable Lumpy to find the coordinates at a base-pair resolution, but it seems like Lumpy did not use them somehow...

I checked some output files. The initial CNV discovery in ~200 high coverage samples generated vcf files for each sample. In these files, this duplication INFO is as follow (example from one sample): CIPOS=-523,29;CIEND=-30,638;CIPOS95=-109,1;CIEND95=-4,75 So, in this stage, the true end position was within CIEND range.

Afterwards, I re-genotyped the 600 samples, and in the final vcf file, the INFO field is as follow: CIPOS=-140,2;CIEND=-17,99;CIPOS95=0,0;CIEND95=-1,1;SU=6362;PE=6362;SR=0; Lumpy CIEND does not include the true end position. CIEND95 clearly eliminated the true end position.

From this, it looks like there were 6362 paired end reads covering this site, which is VERY high and zero split reads. I would have probably classified this as a false positive given the depth and no sr.

Ask the size of the confidence interval is very large. Something is funny here.

Can you describe the sequencing you used?

What did you use to align the reads?

Is this region in a repeat?

Lumpy will use a soft clip only if the clipped region also maps to the genome (a split read). The fact that none or those clips aligned is highly suspect.

I am curious about a couple of things.

how could Limpy give CIEND95=-1,1 (to me looks like a very high resolution), relying solely on PE evidence? I thought either soft clipped reads or split reads evidence is crucial for ascertaining the st/end coordinates would there be a possibility to enable Lumpy to understand the true ending position?

It is the IGV screenshot. In the true ending position, many soft clipped reads are available. The position marked with red is where Lumpy thinks the duplication ends.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

lee039 commented 4 years ago

A small piece of extra information about the duplication: As I mentioned above, I am using Smoove, and thus start CNV calling by running samblaster to collect split & discordant reads. Many of discordant reads at the end of the duplication, are soft clipped (the screenshot above show many colourful reads, indicating soft clip). I checked the disc.bam & split.bam from the samblast and realized that discordant&soft clipped reads are not in those files. I am guessing something might have gone wrong in the first step..

I will answer your questions soon!

lee039 commented 4 years ago

Hi again,

I used samblaster with the following command: python bamgroupreads.py -f -M -i $bam | samblaster --ignoreUnmated -M -a -e -d results-smoove/$sname.disc.sam -s results-smoove/$sname.split.sam -o /dev/null

I realized that, in the original.bam file there are a large number of reads (I am talking about 30X sample, having 5 copy duplication) that are either discordant or discordant + SC. In the samblaster output disc.bam file, there is a gap (chr1:10,061,199-10,067,109) with no reads. In the original bam file there are a lot of disc/disc+SC reads. So I think something went wrong with samblaster.

Can you describe the sequencing you used? ==> it is Illumina short-read sequencing. read length of 100bp, fragment length ~500bp.

What did you use to align the reads? ==> BWA-MEM

Is this region in a repeat? ==> there are some repeats inside the duplication. In the duplication ending region, no repeats. in the duplication starting region, there is repeat spanning over the breakpoint. could this be a problem?

Please feel free to advise me on my use of Samblaster, or any other things I can do, if you see something odd.... Yet, something is strange... because my deletion calls look fine. This peculiar duplication is very strange.

Thanks a lot !

ryanlayer commented 4 years ago

If you are using smoove, then you shouldn't do those pre-processing steps yourself. Just run

smoove call -x --name my-cohort --exclude $bed --fasta $fasta -p $threads --genotype /path/to/*.bam

On Thu, Mar 12, 2020 at 11:54 AM Limitto notifications@github.com wrote:

Hi again,

I used samblaster with the following command: python bamgroupreads.py -f -M -i $bam | samblaster --ignoreUnmated -M -a -e -d results-smoove/$sname.disc.sam -s results-smoove/$sname.split.sam -o /dev/null

I realized that, in the original.bam file there are a large number of reads (I am talking about 30X sample, having 5 copy duplication) that are either discordant or discordant + SC. In the samblaster output disc.bam file, there is a gap (chr1:10,061,199-10,067,109) with no reads. In the original bam file there are a lot of disc/disc+SC reads. So I think something went wrong with samblaster.

Can you describe the sequencing you used? ==> it is Illumina short-read sequencing. read length of 100bp, fragment length ~500bp.

What did you use to align the reads? ==> BWA-MEM

Is this region in a repeat? ==> there are some repeats inside the duplication. In the duplication ending region, no repeats. in the duplication starting region, there is repeat spanning over the breakpoint. could this be a problem?

Please feel free to advise me on my use of Samblaster, or any other things I can do, if you see something odd.... Yet, something is strange... because my deletion calls look fine. This peculiar duplication is very strange.

Thanks a lot !

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/328#issuecomment-598335220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEUGUPKK6H6CCYH7E4YSF3RHEOU5ANCNFSM4LFQUBCQ .

lee039 commented 4 years ago

Thanks for the recommendation. I will run is as you suggested and let you know the outcome.