GreeshmaThulasi commented 6 years ago

Hi, I have single-end sequenced maternal blood data. I executed WISECONDOR and got result like this, `# BAM information: # Reads mapped: 4021312 Reads unmapped: 24027 Reads nocoord: 24027 Reads rmdup: 546798 Reads lowqual: 132340

RETRO filtering:

Reads in: 4676255 Reads removed: 1356445 Reads out: 3319810

Z-Score checks:

Z-Score used: 5.08 AvgStdDev: 6.23% AvgAllStdDev: 31.68%

Test results:

z-score effect mbsize location 6.19 2.07 80.75 2:34250000-115000000 -6.26 -6.52 11.25 7:52500000-63750000 5.14 6.60 7.00 7:75250000-82250000 -6.86 -3.11 49.50 19:9750000-59250000 ` Is the z-score above threshold 5.08 means a duplication and below threshold shows deletion? Whats the effect field for ? Is the location indicates the chromosome and start-end base positions? I read that, there is different methods like, Single bin, bin test Single bin, aneuploidy test Windowed, bin test Windowed, aneuploidy test Chromosome wide, aneuploidy test. How to choose these methods? I executed the steps described in the link https://github.com/VUmcCGP/wisecondor Please describe steps to find microdeletions along the chromosomes

looking forward to hear from you Thanks in advance Greeshma

rstraver commented 6 years ago

Hi Greeshma,

I'm not sure what size exactly the micro deletions you are looking for would be, WISECONDOR was originally written to target fairly lengthy but barely deviating CNVs. I have found the latest version was able to find a CNV between 3 and 4 mb but I'd be hesitant to just take such short CNV results as truth without further testing.

To answer your questions directly:

Is the z-score above threshold 5.08 means a duplication and below threshold shows deletion?

A z-score above 5.08 means a duplication, a negative z-score beyond -5.08 will mean a deletion. Anything in between -5.08 and 5.08 will be considered unaffected.

Whats the effect field for ?

That's the effect size. It's the determined % of copy number change for that particular region. If it says 100 it found twice as much DNA fragments as expected, 5 means it found 5% more DNA fragments.

Is the location indicates the chromosome and start-end base positions?

That is correct, it is chr:start-end

I read that, there is different methods like, Single bin, bin test Single bin, aneuploidy test Windowed, bin test Windowed, aneuploidy test Chromosome wide, aneuploidy test. How to choose these methods?

Those were implemented in an older version of WISECONDOR (as described in the paper). If you wish to use that version you can find it in the legacy branch: https://github.com/VUmcCGP/wisecondor/tree/legacy The master branch at this point has no such test differences, instead is used a segmentation algorithm to find the optimal CNV cutoffs.

If you really aim to find small CNVs perhaps the input data is a bit limiting, it seems you have ~4 million reads, I'd suggest trying something over ~10 million and using a fairly large set of training samples if you are unable to find known short CNVs.

Additionally, I believe this fork of WISECONDOR could be of interest to you, as it should contain several improvements over my work: https://github.com/leraman/wisecondorX Perhaps that can help you find micro deletions better. It should be faster and it is actively being developed right now.

Let me know if something is still unclear.

GreeshmaThulasi commented 6 years ago

Thank you so much Roy Straver, Your reply was very clear and informative for me. I will use samples with reads > 8 million. One more doubt, while creating reference set, whether the samples contained should be normal, ie without any microdeletions or microduplications ? If we are adding reference samples of reads like 8 million, 10 million, 12 million etc will it affect the efficacy of this tool? Do we need to fix the reads within a stringent range like 11-12 million only (by not to including samples with low or high coverage) ? Is this tool uses sliding-window approach ? Is it good to reduce the bin-size or increase the bin size ? I will try the extended version WisecondorX too.

Thank you Greeshma.

rstraver commented 6 years ago

while creating reference set, whether the samples contained should be normal, ie without any microdeletions or microduplications ?

Training samples should preferably be without any CNVs. However, it's pretty much impossible to ensure that is true and if you use many reference samples (i.e. hundreds) and a few (one or two) have the same CNV, I highly doubt it's going to influence your sensitivity much (if at all) as it's not really systematic behaviour.

If we are adding reference samples of reads like 8 million, 10 million, 12 million etc will it affect the efficacy of this tool? Do we need to fix the reads within a stringent range like 11-12 million only (by not to including samples with low or high coverage) ?

Anything from 5 to 20 million should be fine, no need to be very stringent unless you go to very small binsizes, you may want to ensure you have enough coverage per bin if you do.

Is this tool uses sliding-window approach ?

The master branch does not use the sliding-window approach, it has been replaced by a segmentation step. Instead that step will give a stouffers z-score for a region of any possible length, making sure the z-score is the (absolute) maximum possible for that region.

Is it good to reduce the bin-size or increase the bin size ?

It's a trade-off: Smaller means less data per bin, but more bins to use as reference bins. Surely needs more time per sample, may increase erratic behavior if low coverage, but if enough training data was available may also give good results on small CNVs. Larger would be the exact opposite, with the upside that the read coverage per bin is likely a bit more stable, allowing analyses of lower coverage samples. Seeing there are few bins left to use as a reference with binsizes > 2 mb I'd advise staying with smaller binsizes. I'd guess about 100 kb or maybe 50 kb is the smallest you could try, beyond that the reliability and time per sample may not be worth it, but that may be solved in WisecondorX anyway.

GreeshmaThulasi commented 6 years ago

Hi @rstraver , Is the effect of a microdeletion is indicated in negatives? If the effect of a microdeletion is -8.54, what does it indicates? For a reliable microdeletion, the effect should be a high negative value?

rstraver commented 6 years ago

Assuming you are talking about the effect size, that value would mean it measured 8.54% less fragments than it expected to find, which could indicate a microdeletion that is much smaller than the bin, or only is found in a subset of the cells analysed (mosaicism in our case of cell free DNA).

GreeshmaThulasi commented 6 years ago

For a significant microdeletion, how much should be the effect size?

rstraver commented 6 years ago

I'm afraid that is not within my knowledge, I never aimed to find microdeletions and I never tested for them. I suggest you set up some experiments to test the reliability for various thresholds for that.

WISECONDOR mostly uses a z-score threshold instead of an effect size based one, as the effect size may be quite high caused by a not-so-relaible reference set, which is taken into account with the z-score. Also, you may find spikes in few or single bins that often turn out to be meaningless, so be careful on that...

GreeshmaThulasi commented 6 years ago

Yes. I think, that's why I am getting some micro-deletions and microduplications as well (may be of no effect), even for the normal samples.

VUmcCGP / wisecondor

Finding microdeletions #36

RETRO filtering:

Z-Score checks:

Test results: