Add example GA4GH tables from dbGaP to demonstrate how a user would resolve differences in values/coding

ianfore commented 3 years ago

Added one table so far with unique codings for sex and race. Column (variable) names also unique The table so far search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru

Aiming for three or four such tables from dbGaP. The codings and column names will vary.

The question is: how the machine readable information (schema) provided about each table can help make it easier for a data scientist? We assume they are using tools such as python or R and can transform the data in those tools quite easily as long as they have the information to do so. /table/tablename/info provides that information.

Note that in dbGaP the data used in the table above is controlled access. The dataset available through the GA4GH Search API uses values from the dataset but each record (row) is a simulated example - not a real record.

ianfore commented 3 years ago

See [this diagram](

) for a flow of how the data dictionaries get from dbGaP to GA4GH Search. In the case of the tables I've created this weekend we haven't yet run the data dictionaries through the DNAStack step. So for the moment the schemas we seen in GA4GH Search are autogenerated from the BigQuery table definitions. That doesn't include the enumerated listings of codes. However we can for the moment get the definition from the dbGaP data dictionary itself. For example, here's a link to the data dictionary for the data in the organoid_profiling_pc_subject_phenotypes_gru table.

Note that if you open the link in a web browser it will display as an html table. For API purposes you can read the xml programmatically. One approach is to read the dictionary visually and translate the data in Python or R.

There are some mapping tables that can help which I will add to GA4Gh Search.

ianfore commented 3 years ago

Created these two tables to do a mapping. search_cloud.cshcodeathon.md_mapping search_cloud.cshcodeathon.md_mapping_term Working on an example to use them.

ianfore commented 3 years ago

Mapping example added.

Now need to map more columns!

STRIDES-Codes / subject-sample-search

Add example GA4GH tables from dbGaP to demonstrate how a user would resolve differences in values/coding #4