Rethink and simplify program design

Lipastomies commented 3 years ago

Currently the program is divided into approximately four steps:

Group variants, add CS information a. But also add phenotype information, and also crash on invalid phenoinfo data. Yikes.
Add annotations, creating a large dataframe.
Compare against gwas catalog, which is an annotation of sorts.
Create top report a. this part has no knowledge of the annotations, so since it selects some of the columns from some of the annotations, it has to know their names -> orchestrating between the report and top report creation. Currently this is hard-coded, which means a lot of fiddling and a lot of typos, thought mistakes etc when changing both this and annotations.

I recently thought about this, and there are at least a few things that could be done differently:

Separate phenotype information to be its own kind of annotation, move to annotation step. Limit grouping as a step to the minimal amount of operations needed, mostly so debugging would be clearer.
Abstract away the annotations, e.g. behind an AnnotationDAO or whatever. Then ensure that they are handled correctly. This would simplify the code quite a bit, AND add a way for easy addition of different annotations - just write DAO for it & add to args.
Use these abstracted annotations in creating the top report. For example, information on whether an annotation is included in top report could be added to the DAO, e.g. by adding a get_top_report_annotation() method to the AnnotationDAO. Some parts do not generalise that well, like novelty reporting and functional variant aggregation, but that I can live with.
Having these annotations as an abstract DAO would also make it possible to stop caring about the implementation details - I imagine it would be relatively simple to just deal with a container of AnnotationDAOs, add them to the top report as their implementation defines, and maybe have a separate container/some sort of filter for the aggregated data. But we'll see.
Instead of eagerly joining all of the annotations to the grouped data as soon as possible, maybe let them live as annotation objects that can be merged to the grouped data when we want to create the report or top report. This could make data processing simpler.

So the program structure would be three steps after the changes:

These steps IMO pave the way for some more ambitious changes:

Switch to a config file from args. This would integrate quite well with having most data resources as DAOs, I think. I imagine something like pheweb's DAO config would be the easiest. Using command-line args gets old fast when there are more than 20 flags and options that need to be set.
Use actual multithreading to load the annotations. Since they don't depend on each other, it should be trivial to just load them using multiprocessing and a threadpool. Would be nice since e.g. gwas catalog annotation is 99% waiting for a request to go through. Other annotations are single-threaded and most likely processor-bound, so having may of them run at the same time should be no problem.

Lipastomies commented 3 years ago

PhenoinfoDAO is implemented in branch phenoinfo_dao

Lipastomies commented 1 year ago

This was done in the new grouping PR.

FINNGEN / autoreporting