bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

PCADAPT error when higher K - can't compute SVD #84

Closed GabryS3 closed 7 months ago

GabryS3 commented 7 months ago

Dear Florian, I am just exploring now for the first time your package pcadapt.

It seemed to run perfectly until I wanted to retain up to K=150, but if I try to retain 200 PCs or above, it gives me the following error:

< Error: Can't compute SVD. Are there SNPs or individuals with missing values only? You should use PLINK for proper data quality control.

I don’t quite understand this message: my dataset was filtered with locus CR >0.97, Reproducibility >0.98, a good read depth cutoff eliminating the tail of the distribution and MAF>0.05. Individual CR is >0.98. So missingness is quite low in general. In addition, prior to converting the genlight dataset (package dartR) into a lfmm file format for pcadapt, I removed all NAs in the dataset (with a dartR function which deletes loci or individuals with all calls missing, that is all NAs). Could you help me understand what is the problem?

The reason why I wanted to run pcadapt retaining a higher number of PCs is that in my screeplot the curve of PCs keeps descending smoothly without a clear cutoff point/drop where the PCs start to seem to be explaining random noise (no horizontal line on the x axis). So I cannot use Cattel’s rule to chose the number of PCs to retain, and I thought that if I increase the number of PCs retained I may be able to find the cutoff point. Do you think that, given that the variance explained by the PCs decreases slowly without a clear drop in the curve, is it helpful if I try run pcadapt retaining way more PCs?

I would appreciate any feedback on this.

Thank you for your kind assistance. Best, Gabriella

privefl commented 7 months ago
GabryS3 commented 7 months ago

Dear Florian, thank you for your reply.

See my replies in red to your questions below:

Il giorno gio 4 gen 2024 alle ore 17:34 Florian Privé < @.***> ha scritto:

  • What is the size of your dataset?

n Loci = 9529, n Indiv = 182

  • Which packageVersion(pcadapt) do you have?

packageVersion("pcadapt") # ‘4.3.3’

  • Could you show the screeplot? (just drag&drop the figure here)

    PCADAPT # 1 - kept K=100 [image: image.png] PCADAPT #2 - kept K=50 --> plot K=50

[image: b8c9d61a-ebf6-4f0b-94c4-a2c9f6f29f28.png]

PCADAPT #2 - kept K=50 --> plot K=10 [image: image.png] Also, the Manhattan plot changes quite a bit if I retain 50, 10 or 2 K in pcadapt, showing a more homogeneous distribution of values/dots, with fewer dots standing far away from the rest. With K=50, I get 96 outliers with BH-FDR and 7 with Bonferroni, while I get 0 outliers with K=10 or K=2 (in both BH and Bonferroni). Plus, in the QQplot, the expected p-values get much closer to the observed p-values gradually from K=50 to K=10, K=2. And the distribution of the D statistics is similar in K=50 and K=10, resembling a normal distribution with a long right tail, while in K=2 the distribution looks like a chi-square distribution starting at 0 with a right skewed distribution of values (right tail only).

Does this mean I should be keeping K=2 only?

Thank you so much for your feedback. Best, Gabriella

— Reply to this email directly, view it on GitHub https://github.com/bcm-uga/pcadapt/issues/84#issuecomment-1876611775, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2BDRZC2VVL5LIMTFTHERITYMZLRTAVCNFSM6AAAAABBMHWLMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZWGYYTCNZXGU . You are receiving this because you authored the thread.Message ID: @.***>

privefl commented 7 months ago

I do not see the figures. I think you should reply on GitHub instead of in the email directly.

GabryS3 commented 7 months ago

Dear Florian, thank you for your reply.

See my replies in bold to your questions below:

What is the size of your dataset? n Loci = 9529, n Indiv = 182 Which packageVersion(pcadapt) do you have? packageVersion("pcadapt") # ‘4.3.3’ Could you show the screeplot? (just drag&drop the figure here) PCADAPT # 1 - kept K=100

image

PCADAPT #2 - kept K=50 --> plot K=50

image

PCADAPT #2 - kept K=50 --> plot K=10

image

Also, the Manhattan plot changes quite a bit if I retain 50, 10 or 2 K in pcadapt, showing a more homogeneous distribution of values/dots, with fewer dots standing far away from the rest. With K=50, I get 96 outliers with BH-FDR and 7 with Bonferroni, while I get 0 outliers with K=10 or K=2 (in both BH and Bonferroni). Plus, in the QQplot, the expected p-values get much closer to the observed p-values gradually from K=50 to K=10, K=2. And the distribution of the D statistics is similar in K=50 and K=10, resembling a normal distribution with a long right tail, while in K=2 the distribution looks like a chi-square distribution starting at 0 with a right skewed distribution of values (right tail only).

Does this mean I should be keeping K=2 only?

Thank you so much for your feedback. Please let me know if you can see the figures now.. Best, Gabriella

privefl commented 7 months ago

Ok, the main issue here is that you ask for K=200 PCs, but the minimum dimension is 182, so that's not possible. In the latest version of the package (which is on CRAN), you should get a more informative error about this.

From the screeplots you show, I would pick either K=1 or K=5, or something close. To chose K, I would visualize first PC scores (say k=1:8) and decide where there seems to be some population structure (it is even better if you know the different populations, and can color PC scores by them).

GabryS3 commented 7 months ago

Oh, ok thanks. I thought there should be as many PCs as there are SNPs. Yes, thank you. I did plot by various PCs and color by population and no clustering/pattern shows up in the data. So, I am assuming it's ok for me to chose K=2. Thank you again! That was really helpful! Best, Gabriella