FINNGEN / autoreporting

MIT License
0 stars 1 forks source link

Drop pandas from main data pipeline #202

Closed Lipastomies closed 1 year ago

Lipastomies commented 1 year ago

This PR is pretty much a rewrite of the whole autoreporting pipeline. Some data sources are kept the same due to them not being that much of a problem for now (gwascatalog api etc). The grouping, annotation, report creation is completely redone.

The main idea is that instead of having a pandas table that gets more and more columns, instead we first form the data that consists of groups of variants, and then fetch annotations for them from various sources. These annotations are not directly joined to them at the point of fetching. Instead, the variant and group reports are formed from these sources of data.

The basic unit of autoreporting is a Locus. A locus has a peak variant, and may contain LD/range partners and credible set variants. Currently these are divided into two locus types, but that does not affect the code much.

The pipeline currently forms the loci based on options, then uses all of the variants in loci to fetch annotations, and in finally the data is used to create reports, variant and group reports.

The previous comparison step where variants were compared against gwas catalog etc, is now a variant annotation step. This allows for most of the code in the compare step to be completely thrown out.

One other advantage of this reform is that the code now has types instead of being a pandas table with ambiguous columns. This makes reasoning about the code much easier in most places.

Lipastomies commented 1 year ago

Note: readme is going to get outdated with this PR, but that's ok, I'll fix it next.