berman-lab / ymap

YMAP - Yeast Mapping Analysis Pipeline : An online pipeline for the analysis of yeast genomic datasets.
MIT License
6 stars 6 forks source link

Incorrect cutoff detection in allele ratios #53

Open vladimirg opened 7 years ago

vladimirg commented 7 years ago

The current algorithm for SNP ratio cutoffs sometimes gives incorrect results. In the attached image, compare Chr1 (which is correctly analyzed) with Chr2, 3, 4 and 7, which all look to have about the same allele ratio histogram, but the ratio assignments are wrong, and inconsistently so. This was observed even for datasets with high peaks (as the peaks for the heterozygous allele group is low).

Presumably, the current algorithm tries adjusting to small possible changes in allele ratios (the peaks could move a little from their theoretical values), however it's not too helpful, as it strongly depends on the user-provided ploidy, which can often be incorrect.

We currently propose to divide the allele ratio histogram into ploidy+1 equally-spaced bins, and assign ratios to SNPs based on the bins they fall in to. While that would introduce noise, but it seems it will be considerably small, and will not interfere too much in visual assessment of the segment.

This should be coupled with auto-detection of the baseline ploidy (#52) for maximum effect.

@darrenabbey , what do you think? We're currently diving deep into the algorithm that does the ratio assignments based on the ratio histogram, and it just seems too sensitive, even if the ploidy is correctly set. We even saw behavior where a peak of 1:3 would be flanked to the right by 0:4 (as expected), but would also be flanked to the left by 0:4 (which is clearly a bug). So the question is, what scenarios does the current algorithm cover that the suggestion above won't be able to handle?

screen shot 2016-09-27 at 17 33 15
darrenabbey commented 7 years ago

I think it is worthwhile to have a cutoff determining system with fewer built in assumptions like you describe for high-noise datasets like this. I don't think it should always be used, however, as it was counter-cases that led to that (previously written just as your idea) system being discarded in favor of the more detailed system in place now.

On Sep 27, 2016 10:06 AM, "Vladimir Gritsenko" notifications@github.com wrote:

The current algorithm for SNP ratio cutoffs sometimes gives incorrect results. In the attached image, compare Chr1 (which is correctly analyzed) with Chr2, 3, 4 and 7, which all look to have about the same allele ratio histogram, but the ratio assignments are wrong, and inconsistently so. This was observed even for datasets with high peaks (as the peaks for the heterozygous allele group is low).

Presumably, the current algorithm tries adjusting to small possible changes in allele ratios (the peaks could move a little from their theoretical values), however it's not too helpful, as it strongly depends on the user-provided ploidy, which can often be incorrect.

We currently propose to divide the allele ratio histogram into ploidy+1 equally-spaced bins, and assign ratios to SNPs based on the bins they fall in to. While that would introduce noise, but it seems it will be considerably small, and will not interfere too much in visual assessment of the segment.

This should be coupled with auto-detection of the baseline ploidy (#52 https://github.com/berman-lab/ymap/issues/52) for maximum effect.

@darrenabbey https://github.com/darrenabbey , what do you think? We're currently diving deep into the algorithm that does the ratio assignments based on the ratio histogram, and it just seems too sensitive, even if the ploidy is correctly set. We even saw behavior where a peak of 1:3 would be flanked to the right by 0:4 (as expected), but would also be flanked to the left by 0:4 (which is clearly a bug). So the question is, what scenarios does the current algorithm cover that the suggestion above won't be able to handle?

[image: screen shot 2016-09-27 at 17 33 15] https://cloud.githubusercontent.com/assets/1148887/18878623/017ece6a-84da-11e6-8a89-3322ce944f39.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berman-lab/ymap/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/AKPuRJuPAiUHhyIHgQe-6DnXkObs4vv-ks5quS41gaJpZM4KHwjU .

darrenabbey commented 7 years ago

Perhaps a toggle option during dataset setup? I'm not sure I'd want to try and automatically determine when the data quality would best fit one or the other scheme.

On Sep 27, 2016 2:50 PM, wrote:

I think it is worthwhile to have a cutoff determining system with fewer built in assumptions like you describe for high-noise datasets like this. I don't think it should always be used, however, as it was counter-cases that led to that (previously written just as your idea) system being discarded in favor of the more detailed system in place now.

On Sep 27, 2016 10:06 AM, "Vladimir Gritsenko" notifications@github.com wrote:

The current algorithm for SNP ratio cutoffs sometimes gives incorrect results. In the attached image, compare Chr1 (which is correctly analyzed) with Chr2, 3, 4 and 7, which all look to have about the same allele ratio histogram, but the ratio assignments are wrong, and inconsistently so. This was observed even for datasets with high peaks (as the peaks for the heterozygous allele group is low).

Presumably, the current algorithm tries adjusting to small possible changes in allele ratios (the peaks could move a little from their theoretical values), however it's not too helpful, as it strongly depends on the user-provided ploidy, which can often be incorrect.

We currently propose to divide the allele ratio histogram into ploidy+1 equally-spaced bins, and assign ratios to SNPs based on the bins they fall in to. While that would introduce noise, but it seems it will be considerably small, and will not interfere too much in visual assessment of the segment.

This should be coupled with auto-detection of the baseline ploidy (#52 https://github.com/berman-lab/ymap/issues/52) for maximum effect.

@darrenabbey https://github.com/darrenabbey , what do you think? We're currently diving deep into the algorithm that does the ratio assignments based on the ratio histogram, and it just seems too sensitive, even if the ploidy is correctly set. We even saw behavior where a peak of 1:3 would be flanked to the right by 0:4 (as expected), but would also be flanked to the left by 0:4 (which is clearly a bug). So the question is, what scenarios does the current algorithm cover that the suggestion above won't be able to handle?

[image: screen shot 2016-09-27 at 17 33 15] https://cloud.githubusercontent.com/assets/1148887/18878623/017ece6a-84da-11e6-8a89-3322ce944f39.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berman-lab/ymap/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/AKPuRJuPAiUHhyIHgQe-6DnXkObs4vv-ks5quS41gaJpZM4KHwjU .

darrenabbey commented 7 years ago

Another thought. Thus looks like a figure A. Selmecki had asked me about, in that the SNP histograms had the same shape.

After looking at the rest of the figure, it looks like the strain is unrelated to the hapmaped lineage. In such a case, the vast majority of the hapmap loci will be completely, but randomly, homozygous. This would result in the strong spikes at the left and right ends if the scriptures.

On Sep 27, 2016 2:51 PM, "Darren Abbey" darrenabbey.1@gmail.com wrote:

Perhaps a toggle option during dataset setup? I'm not sure I'd want to try and automatically determine when the data quality would best fit one or the other scheme.

On Sep 27, 2016 2:50 PM, wrote:

I think it is worthwhile to have a cutoff determining system with fewer built in assumptions like you describe for high-noise datasets like this. I don't think it should always be used, however, as it was counter-cases that led to that (previously written just as your idea) system being discarded in favor of the more detailed system in place now.

On Sep 27, 2016 10:06 AM, "Vladimir Gritsenko" notifications@github.com wrote:

The current algorithm for SNP ratio cutoffs sometimes gives incorrect results. In the attached image, compare Chr1 (which is correctly analyzed) with Chr2, 3, 4 and 7, which all look to have about the same allele ratio histogram, but the ratio assignments are wrong, and inconsistently so. This was observed even for datasets with high peaks (as the peaks for the heterozygous allele group is low).

Presumably, the current algorithm tries adjusting to small possible changes in allele ratios (the peaks could move a little from their theoretical values), however it's not too helpful, as it strongly depends on the user-provided ploidy, which can often be incorrect.

We currently propose to divide the allele ratio histogram into ploidy+1 equally-spaced bins, and assign ratios to SNPs based on the bins they fall in to. While that would introduce noise, but it seems it will be considerably small, and will not interfere too much in visual assessment of the segment.

This should be coupled with auto-detection of the baseline ploidy (#52 https://github.com/berman-lab/ymap/issues/52) for maximum effect.

@darrenabbey https://github.com/darrenabbey , what do you think? We're currently diving deep into the algorithm that does the ratio assignments based on the ratio histogram, and it just seems too sensitive, even if the ploidy is correctly set. We even saw behavior where a peak of 1:3 would be flanked to the right by 0:4 (as expected), but would also be flanked to the left by 0:4 (which is clearly a bug). So the question is, what scenarios does the current algorithm cover that the suggestion above won't be able to handle?

[image: screen shot 2016-09-27 at 17 33 15] https://cloud.githubusercontent.com/assets/1148887/18878623/017ece6a-84da-11e6-8a89-3322ce944f39.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berman-lab/ymap/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/AKPuRJuPAiUHhyIHgQe-6DnXkObs4vv-ks5quS41gaJpZM4KHwjU .

darrenabbey commented 7 years ago

Oi. Negate the comparison to Selmecki's figure. Hers has side peaks at only one end of subfigure, though it does show the same overall high noise pattern.

On Sep 27, 2016 2:55 PM, "Darren Abbey" darrenabbey.1@gmail.com wrote:

Another thought. Thus looks like a figure A. Selmecki had asked me about, in that the SNP histograms had the same shape.

After looking at the rest of the figure, it looks like the strain is unrelated to the hapmaped lineage. In such a case, the vast majority of the hapmap loci will be completely, but randomly, homozygous. This would result in the strong spikes at the left and right ends if the scriptures.

On Sep 27, 2016 2:51 PM, "Darren Abbey" darrenabbey.1@gmail.com wrote:

Perhaps a toggle option during dataset setup? I'm not sure I'd want to try and automatically determine when the data quality would best fit one or the other scheme.

On Sep 27, 2016 2:50 PM, wrote:

I think it is worthwhile to have a cutoff determining system with fewer built in assumptions like you describe for high-noise datasets like this. I don't think it should always be used, however, as it was counter-cases that led to that (previously written just as your idea) system being discarded in favor of the more detailed system in place now.

On Sep 27, 2016 10:06 AM, "Vladimir Gritsenko" notifications@github.com wrote:

The current algorithm for SNP ratio cutoffs sometimes gives incorrect results. In the attached image, compare Chr1 (which is correctly analyzed) with Chr2, 3, 4 and 7, which all look to have about the same allele ratio histogram, but the ratio assignments are wrong, and inconsistently so. This was observed even for datasets with high peaks (as the peaks for the heterozygous allele group is low).

Presumably, the current algorithm tries adjusting to small possible changes in allele ratios (the peaks could move a little from their theoretical values), however it's not too helpful, as it strongly depends on the user-provided ploidy, which can often be incorrect.

We currently propose to divide the allele ratio histogram into ploidy+1 equally-spaced bins, and assign ratios to SNPs based on the bins they fall in to. While that would introduce noise, but it seems it will be considerably small, and will not interfere too much in visual assessment of the segment.

This should be coupled with auto-detection of the baseline ploidy (#52 https://github.com/berman-lab/ymap/issues/52) for maximum effect.

@darrenabbey https://github.com/darrenabbey , what do you think? We're currently diving deep into the algorithm that does the ratio assignments based on the ratio histogram, and it just seems too sensitive, even if the ploidy is correctly set. We even saw behavior where a peak of 1:3 would be flanked to the right by 0:4 (as expected), but would also be flanked to the left by 0:4 (which is clearly a bug). So the question is, what scenarios does the current algorithm cover that the suggestion above won't be able to handle?

[image: screen shot 2016-09-27 at 17 33 15] https://cloud.githubusercontent.com/assets/1148887/18878623/017ece6a-84da-11e6-8a89-3322ce944f39.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berman-lab/ymap/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/AKPuRJuPAiUHhyIHgQe-6DnXkObs4vv-ks5quS41gaJpZM4KHwjU .

vladimirg commented 7 years ago

Do you perhaps remember the counter-cases that led to the development of the algorithm currently in place? That would greatly aide in our understanding of the problem and possible solutions.

Regarding Anna's strains, them not coming from the hapmapped lineage is an interesting point. In general, analyzing clinical strains is an open question I haven't quite figured out yet (a short while ago w looked at clinical glabrata strains we seemingly random colors, and I had no idea how to interpret that). For example, what happens if we see a SNP not in the hapmap? Do we color it grey (if it's 1:1)? Red? Would it be worthwhile to create a SNP-only analysis of such strains, to show just the heterozygous regions? Would it be worthwhile to create a "phylogenetic closeness" metric of some sort, to alert users that the hapmap may not be informative?

darrenabbey commented 7 years ago

The counter-cases all had SNP ratio profiles with a small het peak and a large hom peak. With two adjacent Gaussians of different height the crossover point of equal likelihood is shifted away from the large peak. Defining the ratio cutoffs by the simple method ends up with a chunk of the hom data being scored as het, which results in a figure with a hom region

looking somewhere between hom and het.

Generally the hapmap derived for SC5315 is only useful for strains derived from SC5314. Any random isolate if the same species us unlikely to be a close enough relation to have significant overlap of heterozygous loci. At least, I've never seen one.

For such strains, it definitely makes sense to do a SNP only analysis if there isn't a reasonable parent dataset for comparison.

I think being clear that a hapmap is only meaningful with descendents in the same lineage is important. Automatically determining if the hapmap is being useful in analysis if a dataset might be difficult. Even if we had a

good way to call it, I'd prefer if left to user choice.

You have s C. glabrata hapmap now? Cool!

On Sep 27, 2016 5:12 PM, "Vladimir Gritsenko" notifications@github.com wrote:

Do you perhaps remember the counter-cases that led to the development of the algorithm currently in place? That would greatly aide in our understanding of the problem and possible solutions.

Regarding Anna's strains, them not coming from the hapmapped lineage is an interesting point. In general, analyzing clinical strains is an open question I haven't quite figured out yet (a short while ago w looked at clinical glabrata strains we seemingly random colors, and I had no idea how to interpret that). For example, what happens if we see a SNP not in the hapmap? Do we color it grey (if it's 1:1)? Red? Would it be worthwhile to create a SNP-only analysis of such strains, to show just the heterozygous regions? Would it be worthwhile to create a "phylogenetic closeness" metric of some sort, to alert users that the hapmap may not be informative?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berman-lab/ymap/issues/53#issuecomment-250014731, or mute the thread https://github.com/notifications/unsubscribe-auth/AKPuRIP60GM-0dsVCkJGelzIq1NZd4f2ks5quZTXgaJpZM4KHwjU .

vladimirg commented 7 years ago

Regarding the counter-cases, would a strain which has mostly undergone LOH but still has a few het regions left match that description?

darrenabbey commented 7 years ago

That was basically the scenario where I first noticed it. I don't recall which strain, but it had one chromosome mostly homozygosed, with a residual small region still heterozygous.

On Tue, Sep 27, 2016 at 6:46 PM, Vladimir Gritsenko < notifications@github.com> wrote:

Regarding the counter-cases, would a strain which has mostly undergone LOH but still has a few het regions left match that description?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berman-lab/ymap/issues/53#issuecomment-250031674, or mute the thread https://github.com/notifications/unsubscribe-auth/AKPuRIJXYe4n_GWlZHpaHfSqM4eXshITks5quarJgaJpZM4KHwjU .