cbeauhilton commented 3 years ago

Model Example


from sqlmodel import Field, SQLModel

class MetricBase(SQLModel):
    organism: str = "homo sapiens"
    clinical_setting: str = "all"
    sample_source: str = "blood"
    metric_name: str
    units: str
    maximum_known: Optional[float]
    minimum_known: Optional[float]
    maximum_known_ref: Optional[str]
    minimum_known_ref: Optional[str]
    upper_limit_normal: Optional[float]
    lower_limit_normal: Optional[float]
    normals_ref: Optional[str}

class Metric(MetricBase, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)

cbeauhilton commented 3 years ago

Data Dictionary

organism : defaults to humans, but would be wise to make more general (and allow comparative bio studies?)
clinical_setting: defaults to all, to capture absolute known max and min, but if there are other clinical settings that would be useful to filter on capture these as well (e.g. hyperferritinemia in HLH seems to cap somewhere south of 100k, but in heme malignancy can reach to >150k; sex differences; age differences). Might need to define combinations of unique fields/complex primary keys to allow for these (maybe organism+clinical_setting+sample_source+metric_name, where clinical_setting is also a complex key sex+age+clinical_scenario, maybe options for geography/race/ethnicities, as might be important in something like in benign ethnic neutropenia?).
sample_source : defaults to blood, but could be serum, CSF, something like "body" for measurements such age/weight/height, particular imaging study source (CT w certain cuts, TTE measurements, ...), etc.
metric_name : pick a generalized name
units : For each metric, would have to pick a canonical unit, then if people contribute metrics in other units would have to convert (initially, probably just make them do their own conversions prior to contributing - but ‘Pint’ is great).
For the min/max, have to make these Optional[float] as there may be only a min OR max reported, some things don't make sense (Hgb 0 == dead)
Ditto for upper/lower limits of normal
*_ref : preferably from peer-reviewed literature. This whole project may be an interesting way to mobilize case studies. Also might have "in-house from xyz_institution" as an option, for authorized committers (e.g. folks at VUMC included in the project with the ability to pull data from Synthetic/Research derivatives, folks from other places with similar institutional access). For the normal ranges, will start with VUMC in-house reference ranges, but ranges from the literature would be good as well. Again on the megalomaniacal end, if we could include multiple reference ranges from institutions, people could filter to their own locale.

cbeauhilton commented 3 years ago

Data Infrastructure

Relational databases (SQLite/PostgreSQL) are probably the right answer, but a NoSQL approach may make it easier to adjust on the fly without a bunch of migrations. Making these migrations easy is probably the "real" answer. SQLModel is also very, very nice, and would be a shame to lose if we chose a NoSQL approach. Could also do what I’m doing for the ash-abstracts project and build the db from JSON, with ‘alter=true’, which kind of accomplishes both goals.

I'm fairly certain I'm going to miss a bunch of possibly essential fields for the core database. May also make sense to have a separate table for each metric? (I think I like the metric-based approach, as opposed to a hierarchical approach based on e.g. organism)

cbeauhilton / extremebiometry

Data Model #1

Model Example

Data Dictionary

Data Infrastructure