NCI-CGR / plco-analysis

Primary workflow for the PLCO "Atlas" project
2 stars 3 forks source link

custom max number of principal components estimated/used #10

Closed lightning-auriga closed 3 years ago

lightning-auriga commented 3 years ago

I've started hearing rumors that the number of principal components used in association may be changed to some other number greater than the current 10. a nice gift to bequeath to my successor would be a configurable parameter for this somewhere in Makefile.config, and support for that increased ceiling in the model matrix constructor, such that they don't have to deal with an immediate extension buried within makefiles.

lightning-auriga commented 3 years ago

this is probably done, but I'll leave it open until I can test it once.

lightning-auriga commented 3 years ago

ooh new problem. so the component number is indeed increased, which is great. however, now the smallest platform/ancestry combinations (by sample count) cause smartpca to segfault. they're not beneath the theoretical limit of subjects needed for a 20 PC estimation. so this is probably some sort of quiet heuristic bound I've never run into before. the practical effect is that the Makefile.config parameter SAMPLE_SIZE_MIN must be increased a bit above the parameter SMARTPCA_N_PCS.

lightning-auriga commented 3 years ago

i'm going to try SAMPLE_SIZE_MIN := 30 for plco and see what happens

lightning-auriga commented 3 years ago

I unceremoniously declare this handled. PC count is at 20, min sample size per platform/ancestry combo is 50. They are successfully estimated, and downstream methods (primarily construct.model.matrix) intercept them correctly and add them.