NCI-CGR / plco-analysis

Primary workflow for the PLCO "Atlas" project
2 stars 3 forks source link

project agnostic nomenclature/inputs #9

Open lightning-auriga opened 3 years ago

lightning-auriga commented 3 years ago

this project was designed with the intention that it be used for the PLCO "atlas" project. later developments have indicated that there may be the need for this code to be used for other projects. some of the "PLCO" nomenclature is baked into the pipelines (this was initially targeted for removal during an early milestone for the project but was removed from the project plan after those milestones were scrapped by superiors).

lightning-auriga commented 3 years ago

as I go through the pipeline documenting things, I'm going to flag areas where I see "PLCO" nomenclature baked into the code, and list some simple suggestions for fixing them. I may or may not be the one to go through and apply these changes.

lightning-auriga commented 3 years ago

relatedness pipeline: output files are prefixed "PLCO_{chip}"; and the combined (cross platform) output is prefixed simply "PLCO"

suggestions: the cross-platform file isn't actually used downstream, and can be removed (though obviously there are plenty of applications in which a cross-platform relatedness file would be useful). The "PLCO_{chip}" prefixes can be modified to "{chip}" alone, as that should still be unique. Downstream recipients of these output files (specifically in the ancestry pipeline) would need to be updated accordingly.

estimated severity/ease of fix: very straightforward

lightning-auriga commented 3 years ago

ancestry pipeline: receives input from relatedness pipeline that's tagged "PLCO{chip}" (see above). Has hard coded rules for handling ancestry exceptions on specific chips, which is really bad (don't know what I was thinking there, probably "go real fast"). As with relatedness pipeline, prefixes output with "PLCO" or "PLCO{chip}"

suggestions: the cross-platform file isn't actually used downstream (but is used for make ancestry-check, so that needs fixing), and also relies unconditionally on the corresponding relatedness file that isn't guaranteed to be there; thus, the cross-platform file absolutely should be removed from this pipeline, or allowed to be conditionally present. The same solution re:prefixes as was suggested for relatedness can be applied here. The chip-specific rules for overriding subject ancestry calls per platform is really bad. The override calls are manually determined externally to the pipeline. As such, you'll probably need a special override file provided in Makefile.config with subject IDs, target platform, and desired ancestry override. Once this is done, the per-chip rules can be removed, and the general rule can handle all platforms again.

estimated severity/ease of fix: annoying. I may try to push this one through myself to save the next person some PITA

lightning-auriga commented 3 years ago

bgen pipeline: bgen/Makefile deals with the lack of non-redundant subjects in PLCO/Omni5 by manually excluding the Omni5 platform from the Makefile.config variable PLATFORMS. that's not good. I don't know off the top (this is a really old pipeline) whether this exclusion is actually mandatory, or if the pipeline will just deal with the fact that there's nothing present for Omni5. That appears to be the only issue

suggestions: it's worth testing whether the pipeline will work ok if the Omni5 exclusion (in the definition of UNIQUE_PLATFORMS) is just removed. If it doesn't, then the next best option will be to have it do some sort of foreach operation that checks the platforms to be sure there are at least some files present in the input.

estimated severity/ease of fix: pretty simple in either case

lightning-auriga commented 3 years ago

construct.model.matrix: this is tough. i'm just going to patch that manually now.

estimated severity/ease of fix: i got you fam

lightning-auriga commented 3 years ago

the last big problem is the very initial input: everything is expecting "PLCO_{chipname}.{bed,bim,fam}". This is wherever the pipelines touch the inputs: relatedness/, ancestry/, cleaned-chips-by-ancestry/

the solution, and I'm just out of time but this is quite easy, is to have an initial process symlink inputs to standardized names, and then have the pipelines pull from those instead. easy, as I said, and also probably yaml-able, but just takes time i no longer have.