Error when running conStruct analysis

cjenck commented 5 years ago

Hello Gideon,

I am running into an issue when I'm finally ready to begin the conStruct analysis. I successfully ran structure2conStruct to generate the allele frequency matrix and I have both matrices with matching row numbers for the geographic sampling coordinates as well as the geographic distances (calculated with rdist.earth). However, when I run the complete function, I receive the following error:

"Error in process.freq.data(freqs) :

After dropping invariant loci, one or more pairs of samples have no genotyped loci in common, so relatedness between them cannot be assessed."

Have you come across this before? Any tips?

Thank you so much!

Clara

gbradburd commented 5 years ago

Hi Clara,

What's happening is that conStruct is reading in the allele frequency data you've generated with structure2conStruct and trying to generate the allelic covariance matrix, which describes relatedness between all pairs of samples. To get there, the function process.freq.data first drops all loci that are invariant in the set of samples (i.e., not polymorphic). Then it drops all loci that have missing data across all samples (this is uncommon, but could happen if you're running an analysis on a subset of a larger dataset, and the subset happens to have missing data at the locus). After those two steps, process.freq.data tries to calculate the allelic covariance (relatedness) between each pair of samples. If there's a pair of samples that have no genotyped loci in common, e.g.,

geno1 <- c(0.5,0.5,0.5,0.5, NA, NA, NA, NA)
geno2 <- c(NA, NA, NA, NA, 0.5,0.5,0.5,0.5)

the allelic covariance can't be assessed. In that case, the function throws an error with the message you saw.

W/r/t how to proceed, I'd suggest going back to your genetic data. Are there lots of invariant sites? Are there lots of missing data? Or a few individuals who have really high amounts of missing data? Hopefully this is a problem that is restricted to a single pair of samples, and you can get rid of it by dropping the samples with the most missing data. If you're having trouble diagnosing the problem, please reply on this thread and attach an R object (.Robj or .Rdata) of the allele frequency data generated by structure2conStruct and I'll take a look.

Hope that helps! -Gideon

cjenck commented 5 years ago

Hi Gideon!

Thank you for your quick response. I went back upstream and re-did some VCF filtering more stringently to remove individuals with a lot of missing data to see if that might help, and now I run into a new error running the conStruct analysis:

"Error in pos.def.check(obsCov) :

The sample covariance is not positive definite. Check to make sure that none of your samples are identical (after dropping missing data). If that does not fix the problem, try dropping the loci or samples with the most missing data."

I tried attaching an R object but I don't think that GitHub allows for the file type? However, I did try converting the matrix to a .txt file and I saw a lot of missing data, which is likely the issue, correct?

Thank you for your help!

Clara

gbradburd commented 5 years ago

Ah yes sorry github doesn't support .RData or .Robj, but it does support compressed files (.zip/.gz), so you should be able to attach the compressed R object, or just attach it as a .txt file if that's easier.

Re: the positive definite issues - the allelic covariance matrix can be non-positive definite for a number of reasons. The most common is that you have a pair of individuals who are identical at all genotyped loci (this can be easily obscured by missing data). You can check to see whether that's the issue by dropping all missing data and checking to make sure you don't have identical individuals. You can also get this issue if there is linear dependency between your samples; e.g., if you have 1 individual homozygous for the wild type allele at every locus, 1 individual homozygous for the alternate allele at all loci, and 1 individual who is heterozygous at all loci, the heterozygote individual's allele frequencies will be a perfect linear combination of each of the 2 homozygous individuals, which can induce non-positive definiteness. That's just a cartoon example, and in reality, that linear dependency can be generated by combinations of more than 2 samples.

My recommendation would be to look for that kind of dependency in your data (possibly easy to spot with an PCA?), and try dropping one of the problem individuals until your covariance matrix is valid. Alternatively, you could start with the full dataset, and loop through individuals, dropping each one in turn, to see if you can spot the problem individual(s). I'd also try dropping loci for which you have lots of missing data.

cjenck commented 5 years ago

Hello again,

I retried filtering today, removing almost half of my samples and I still run into the same error. I attached this most recent allele freq matrix, compressed this time, if you're interested (I'm fairly new to R so let me know if you can't read it properly/I did it incorrectly!). I will try your other suggestions above with the help of one of my committee members, Erica Larson. She says hello :).

Thank you!

Clara

conStruct_data.rdata.zip

gbradburd commented 5 years ago

So, when I read in your data I see 199 samples. The first 198 samples have entries (either genotypes or missing data) for 17997 loci, while the 199th has entires for only 14017 loci. If I drop the 199th sample and try to run conStruct on the remaining 198 samples, it appears to run fine. Could that be your problem?

Also, just a note for dealing with data objects in R - it looks like you saved your data matrix file using cat or write.table or something, but named it as an .rdata file. In the future, you can save the .rdata file using the command save, and then load it back into R with load. But, if you try to use load on the data file you attached, it spits an error, because R thinks it's an .RData file, and it's actually a text file.

And hi to Larson!!!

gbradburd commented 5 years ago

Hi Clara,

Just doing some book-keeping - do you want me to keep this issue open, or should I mark it as resolved?

gbradburd commented 5 years ago

Hi Clara,

I'm marking this as resolved, but if the issue comes up again, please reopen!

davidcannatella commented 3 years ago

Gideon, I seem to be having the same issue. The error message is "Error in pos.def.check(obsCov) :

The sample covariance is not positive definite. Check to make sure that none of your samples are identical (after dropping missing data). If that does not fix the problem, try dropping the loci or samples with the most missing data."

The matrix has 70 rows (individuals) and 453 loci. There is no missing data, and there are no duplicate rows (checked with distinct()). There are no populations, and the conStruct docs says that a row can be a single sample.

What is unusual is that all allele frequencies (i.e., every cell in matrix) is either 1 or 0. In other words, the allele is fixed in every case. (I can explain why if that matters). In other words, the matrix is at the extreme end of possibilities.

Any suggestions as to how to get past this problem?

Best, David Cannatella

petrelharp commented 3 years ago

Hm, interesting. How many distinct columns are there?

davidcannatella commented 3 years ago

Peter, Distinct as in unique?

Dave

On Thu, Dec 3, 2020 at 10:21 AM Peter Ralph notifications@github.com wrote:

Hm, interesting. How many distinct columns are there?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gbradburd/conStruct/issues/20#issuecomment-738115114, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAU36EG5PDDROCR6QDGYELSS63IZANCNFSM4HQHIETA .

This message is from an external sender. Learn more about why this matters. https://ut.service-now.com/sp?id=kb_article&number=KB0011401

-- David Cannatella Assoc. Director, Biodiversity Center Department of Integrative Biology Patterson Labs, 2415 Speedway University of Texas, Austin, Texas 78712 www.cannatellalab.org 512.453.1620 cell

davidcannatella commented 3 years ago

Assuming that meant unique columns, I found that there are 180 unique columns out of 453. Do you think that is the source of the problem? I could remove the dups, but I think that would affect calculation of distances and Fst, etc.

Best, Dave

On Thu, Dec 3, 2020 at 10:58 AM David Cannatella catfish@utexas.edu wrote:

Peter, Distinct as in unique?

Dave

On Thu, Dec 3, 2020 at 10:21 AM Peter Ralph notifications@github.com wrote:

Hm, interesting. How many distinct columns are there?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gbradburd/conStruct/issues/20#issuecomment-738115114, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAU36EG5PDDROCR6QDGYELSS63IZANCNFSM4HQHIETA .

This message is from an external sender. Learn more about why this matters. https://ut.service-now.com/sp?id=kb_article&number=KB0011401

-- David Cannatella Assoc. Director, Biodiversity Center Department of Integrative Biology Patterson Labs, 2415 Speedway University of Texas, Austin, Texas 78712 www.cannatellalab.org 512.453.1620 cell

-- David Cannatella Assoc. Director, Biodiversity Center Department of Integrative Biology Patterson Labs, 2415 Speedway University of Texas, Austin, Texas 78712 www.cannatellalab.org 512.453.1620 cell

davidcannatella commented 3 years ago

Followup: I re-ran the matrix after removing duplicate columns, but the error message is the same: "Error in pos.def.check(obsCov) : The sample covariance is not positive definite. Check to make sure that none of your samples are identical (after dropping missing data). If that does not fix the problem, try dropping the loci or samples with the most missing data."

There is no missing data...

I guess I could try combining some individuals into populations, which should yield allele frequencies other than 0 and 1. But if you have any ideas about why a 0/1 matrix throws this error, please let me know.

Best, Dave

On Thu, Dec 3, 2020 at 11:11 AM David Cannatella catfish@utexas.edu wrote:

Assuming that meant unique columns, I found that there are 180 unique columns out of 453. Do you think that is the source of the problem? I could remove the dups, but I think that would affect calculation of distances and Fst, etc.

Best, Dave

On Thu, Dec 3, 2020 at 10:58 AM David Cannatella catfish@utexas.edu wrote:

Peter, Distinct as in unique?

Dave

On Thu, Dec 3, 2020 at 10:21 AM Peter Ralph notifications@github.com wrote:

Hm, interesting. How many distinct columns are there?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gbradburd/conStruct/issues/20#issuecomment-738115114, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAU36EG5PDDROCR6QDGYELSS63IZANCNFSM4HQHIETA .

This message is from an external sender. Learn more about why this matters. https://ut.service-now.com/sp?id=kb_article&number=KB0011401

-- David Cannatella Assoc. Director, Biodiversity Center Department of Integrative Biology Patterson Labs, 2415 Speedway University of Texas, Austin, Texas 78712 www.cannatellalab.org 512.453.1620 cell

-- David Cannatella Assoc. Director, Biodiversity Center Department of Integrative Biology Patterson Labs, 2415 Speedway University of Texas, Austin, Texas 78712 www.cannatellalab.org 512.453.1620 cell

-- David Cannatella Assoc. Director, Biodiversity Center Department of Integrative Biology Patterson Labs, 2415 Speedway University of Texas, Austin, Texas 78712 www.cannatellalab.org 512.453.1620 cell

petrelharp commented 3 years ago

Right: 180 is still greater than 70, so it's not the source of the problem, and dropping duplicates does affect the calculations. (It does tell you that there's a fair bit of nonindependence between loci, though.) But still, 180 is a lot less than 453, so maybe we should investigate further: what's the rank of the genotype matrix? (use e.g. the rankMatrix( ) function in Matrixcalc) It seems like the rank is less than 70, in which case the next thing I'd do is to look into the svd( ) to figure out where the problem is coming from. If it is less than 70, then perhaps you could post the result of

s <- svd(X) # X is the genotype matrix
s$u[, abs(s$d) < 1e-6]  # linear combinations of genotypes with zero (very small) covariance
s$v[, abs(s$d) < 1e-6]  # the associated genotype patterns

It might be long; no worries.

davidcannatella commented 3 years ago

Peter, The rank of the matrix with or without the duplicate loci is 70. Any ideas on what I should try next?

As a workaround, I could combine some of the individuals into populations, but I'm intrinsically curious as to what a matrix of entirely 0s and 1s doesn't work.

Dave

On Thu, Dec 3, 2020 at 11:38 AM Peter Ralph notifications@github.com wrote:

Right: 180 is still greater than 70, so it's not the source of the problem, and dropping duplicates does affect the calculations. (It does tell you that there's a fair bit of nonindependence between loci, though.) But still, 180 is a lot less than 453, so maybe we should investigate further: what's the rank of the genotype matrix? (use e.g. the rankMatrix( ) function in Matrixcalc) It seems like the rank is less than 70, in which case the next thing I'd do is to look into the svd( ) to figure out where the problem is coming from. If it is less than 70, then perhaps you could post the result of

s <- svd(X) # X is the genotype matrix s$u[, abs(s$d) < 1e-6] # linear combinations of genotypes with zero (very small) covariance s$v[, abs(s$d) < 1e-6] # the associated genotype patterns

It might be long; no worries.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gbradburd/conStruct/issues/20#issuecomment-738168358, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAU36B47MZUTWI6WPIRJKTSS7EIPANCNFSM4HQHIETA .

This message is from an external sender. Learn more about why this matters. https://ut.service-now.com/sp?id=kb_article&number=KB0011401

-- David Cannatella Assoc. Director, Biodiversity Center Department of Integrative Biology Patterson Labs, 2415 Speedway University of Texas, Austin, Texas 78712 www.cannatellalab.org 512.453.1620 cell

petrelharp commented 3 years ago

Hm. The problem shouldn't be coming from being purely 0/1. Could you tell me what min(abs(svd(X)$d)) is? And/or email me the dataset to have a look at?

davidcannatella commented 3 years ago

Peter, thanks so much for your help.

min(abs(svd(X)$d)) is 0.4406473 with all loci and 0.4368076 with duplicate loci removed.

I've attached the datafile. The first line should be ignored. I've also attached the script.

Thanks for the help.

Dave

petrelharp commented 3 years ago

Those files didn't go through (well, only the first 16 rows) - I think since you're emailing to a github issue. You could navigate to github and attach the files there, or else email me directly.

davidcannatella commented 3 years ago

Two zipped files attached!

Archive.zip

gbradburd / conStruct

Error when running conStruct analysis #20