having memory issue when running PcAdapt with big data set - Githubissues

bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4

https://bcm-uga.github.io/pcadapt

39 stars 10 forks source link

having memory issue when running PcAdapt with big data set #51

Closed shaghayeghsoudi closed 4 years ago

shaghayeghsoudi commented 4 years ago

Hi Florian I just recently decided to run PcAdapt for my data, the genome that I am working with is very huge. So, after applying appropriate filtering I ended up with more than 3 million SNPS. I tried PcAdapt but it failed due to lack of having enough memory and "I do not have access to big-mem machines at the moment unfortunately". Along with PcAdapt I was running BayPass and I noticed it is going to take forever to process such a huge data but BayPass developer suggested me to split the genome into smaller lets say 10Kb SNPs and run the code as a job array which worked quote well for me. I wonder if I using the same strategy is also appropriate for PcAdapt? or if there is a better solution? Thanks, Shaghayegh

privefl commented 4 years ago

Are you using pcadapt v4?

What is the number of individuals you have? What is the size of the .bed file?

You should try using parameter LD.clumping.

shaghayeghsoudi commented 4 years ago

Hey, yes I am using v4 version of pcadapt. I have 132 individuals and 3,123,110 SNPs. size of bed file 104 MB I thought about LD thining, but I would prefer to have all of the SNPs in linkage for now! I also tried running with smaller 100K chunk, I looked at the results now and they are so weird. Also looked the qqplots and histograms of p-values. For some chunks histogram is totally bimodal and for some others they look normal!! For BayaPass I was estimating covariance matrix of allele frequency for each chunk separately and incorporating it into the covariate model. I am assuming when a full data set is broken up into chunks, the “structure” should be estimated for any given chunk, although it should not be different!!! Not sure .

privefl commented 4 years ago

104 MB should be really just fine. How many PCs (K) are you asking for?

privefl commented 4 years ago

Is it running forever? It might that the decomposition fail, e.g. when having variants with only missing values.

shaghayeghsoudi commented 4 years ago

K = 4 It fails and says the memory is not enough. I have filtered for calling rate, removed any marker less than 90% calling rate

privefl commented 4 years ago

Are you using K=4000?

shaghayeghsoudi commented 4 years ago

No the number of k = 4

privefl commented 4 years ago

It should take 1 minute top. Are you allowed to share your data?

shaghayeghsoudi commented 4 years ago

Not quite sure, need to ask!!will let you know soon

privefl commented 4 years ago

This might not be the PCA the problem, but the Mahalanobis distance then.

Hum, I've never pushed the latest version to CRAN. Can you install remotes::install_github("privefl/bigutilsr") and try again?

privefl commented 4 years ago

Can you follow up on this issue please?

shaghayeghsoudi commented 4 years ago

Hello Florian, I am sorry for forgetting to follow up on this issue. I noticed there was a problem in the format of my input file and reason for getting the memory error message. I fixed the input and ran it again and it worked quite well and very fast. Thanks you, Shaghayegh

From: Florian Privé notifications@github.com Sent: Wednesday, June 10, 2020 12:55 AM To: bcm-uga/pcadapt pcadapt@noreply.github.com Cc: Shaghayegh Soudi ssoudi@ucdavis.edu; Author author@noreply.github.com Subject: Re: [bcm-uga/pcadapt] having memory issue when running PcAdapt with big data set (#51)

Can you follow up on this issue please?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/bcm-uga/pcadapt/issues/51#issuecomment-641809868, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APTDIKH6SDJFZ3LIT5XBYYDRV437DANCNFSM4NBZOUTQ.