FINNGEN / autoreporting

MIT License
0 stars 1 forks source link

Rethink and simplify program design #163

Closed Lipastomies closed 1 year ago

Lipastomies commented 3 years ago

Currently the program is divided into approximately four steps:

  1. Group variants, add CS information a. But also add phenotype information, and also crash on invalid phenoinfo data. Yikes.
  2. Add annotations, creating a large dataframe.
  3. Compare against gwas catalog, which is an annotation of sorts.
  4. Create top report a. this part has no knowledge of the annotations, so since it selects some of the columns from some of the annotations, it has to know their names -> orchestrating between the report and top report creation. Currently this is hard-coded, which means a lot of fiddling and a lot of typos, thought mistakes etc when changing both this and annotations.

I recently thought about this, and there are at least a few things that could be done differently:

  1. Separate phenotype information to be its own kind of annotation, move to annotation step. Limit grouping as a step to the minimal amount of operations needed, mostly so debugging would be clearer.
  2. Abstract away the annotations, e.g. behind an AnnotationDAO or whatever. Then ensure that they are handled correctly. This would simplify the code quite a bit, AND add a way for easy addition of different annotations - just write DAO for it & add to args.
  3. Use these abstracted annotations in creating the top report. For example, information on whether an annotation is included in top report could be added to the DAO, e.g. by adding a get_top_report_annotation() method to the AnnotationDAO. Some parts do not generalise that well, like novelty reporting and functional variant aggregation, but that I can live with.
    Having these annotations as an abstract DAO would also make it possible to stop caring about the implementation details - I imagine it would be relatively simple to just deal with a container of AnnotationDAOs, add them to the top report as their implementation defines, and maybe have a separate container/some sort of filter for the aggregated data. But we'll see.
  4. Instead of eagerly joining all of the annotations to the grouped data as soon as possible, maybe let them live as annotation objects that can be merged to the grouped data when we want to create the report or top report. This could make data processing simpler.

So the program structure would be three steps after the changes:

  1. Group, add cs info since it's used in grouping
  2. Add all annotations (annotations now, phenoinfo and gwas catalog comparison)
  3. Output normal report, top report.

These steps IMO pave the way for some more ambitious changes:

Lipastomies commented 3 years ago

PhenoinfoDAO is implemented in branch phenoinfo_dao

Lipastomies commented 1 year ago

This was done in the new grouping PR.