anpefi / msap

The package msap provides a easy way to analyse MSAP (Methylation-Sensible Amplified Polymorphism) data in order to asses epigenetic and genetic differences between groups of samples.
3 stars 2 forks source link

Error in nj(DM_copy) : missing values are not allowed in the distance matrix Consider using njs() #2

Open anpefi opened 8 years ago

anpefi commented 8 years ago

This is the issue #6262 in the R-Forge support tracker

Original comment in R-Forge:

Anonymous message posted by evaughn@email.arizona.edu

I had been running msap successfully with my dataset. I then reduced the number of loci in the dataset (after some data clean up) and now I am getting the following error:

Error in nj(DM_copy) : missing values are not allowed in the distance matrix Consider using njs()

I edited the source code to use njs() but I'm now left with an error that causes my PCoA plotting to fail:

Error in cmdscale(DM, k = length(inds) - 1, eig = T) : NA values not allowed in 'd'

I am assuming that there are values in my distance matrix that don't exist possibly due to the reduction in the size of my data set? Do you have some advice that might help me remedy this problem? I've attached the data file I am running.

Thanks!

anpefi commented 8 years ago

This bug has been reported again by mail. It happens when the dataset yields very low number of MSL with an high proportion of NAs. In that cases when the distance matrix is built it happens that some pairs of individuals cannot be compared as they are at least one NA across all loci compared, yielding a NA in the matrix.

By using njs() instead of nj() you can do the clustering because it is an algortithm designed for incomplete matrices. However, the PCoA cannot be done as, as far I know, there is not any algorithm allowing for working with missing data.

I need to think what is the best way to address this issue and then implement it, and it will take some time. Probably by using a different distance or any heuristic way to give uninformative states a distance. Suggestions are welcome about this.

An alternative workaround, if you could assume no (large) genetic differences between individuals across all the dataset then you could assume that 0/0 patterns are much more probable to be caused by hemimethylation of the target than by mutation causing a lack of the target and then consider them as methylated states (1) instead of missing (NA). In this case (no.bands="h") you can run the full analysis.

anpefi commented 7 years ago

I've been recalled that another workaroud that could work in some datasets is to reduce the probability of NA in distances by reducing the threshold to define a locus as MSL or NML when having discordant patterns (option: error.rate.primer=0). By default (error.rate.primer) is set to 0.05 (the typical error in AFLPs) but it could set to any other value, including 0. Then, those loci with very few discordant patterns would be considered as MSL, increasing this number. In some datasets, setting the threshold to 0 (this assumes that there is no error in the banding) works!.