The APIs for the gwas catalog and such are quite messy right now. Sometimes the input data for functions and methods is pandas dataframes, sometimes it is lists of dicts. Preferably no dicts would be used as ad-hoc classes, as those require one to keep in mind the whole data structure.
Possibilities include:
1) replacing everything with pandas dataframes.
Pros: consistent. We get some operations (easy filtering per field, aggregation) that might be useful in some cases
Cons: Might be too rigid for the data, might have to work around the dataframes to get things working. Might not be a good match for the work done in those functions. The data types are still implicit, and the fields are not enforced anywhere. It's kinda like the List[Dict[str, Any]] option, as the dataframe structure has to be kept in the mind of the programmer.
2) Replacing dicts of stuff with proper dataclasses, named tuples, classes etc.
Pros: Explicit definitions for the datatypes. Better organisation. Less strain on the programmer. Function signatures are more informative, if a function takes in a List[Rsid] or a Dict[Variant] than pd.DataFrame, or even worse, List[Dict[str, Any]]
Cons: Some operations (easy filtering per field, aggregations, data manipulation) can be more verbose and less clear to do than with dataframes. In case of dataclasses, they are available or python 3.7+, and most of the machines I use have python 3.6. Not a problem for named tuples or classes.
In my opinion, it makes sense to define some specific classes/named tuples, for example for gwas catalog variants, and ensembl api output, and then flesh out the structure with those in mind.
The APIs for the gwas catalog and such are quite messy right now. Sometimes the input data for functions and methods is pandas dataframes, sometimes it is lists of dicts. Preferably no dicts would be used as ad-hoc classes, as those require one to keep in mind the whole data structure.
Possibilities include:
1) replacing everything with pandas dataframes.
List[Dict[str, Any]]
option, as the dataframe structure has to be kept in the mind of the programmer. 2) Replacing dicts of stuff with proper dataclasses, named tuples, classes etc.List[Rsid]
or aDict[Variant]
thanpd.DataFrame
, or even worse,List[Dict[str, Any]]
In my opinion, it makes sense to define some specific classes/named tuples, for example for gwas catalog variants, and ensembl api output, and then flesh out the structure with those in mind.