Open ianfore opened 3 years ago
See [this diagram](
) for a flow of how the data dictionaries get from dbGaP to GA4GH Search. In the case of the tables I've created this weekend we haven't yet run the data dictionaries through the DNAStack step. So for the moment the schemas we seen in GA4GH Search are autogenerated from the BigQuery table definitions. That doesn't include the enumerated listings of codes. However we can for the moment get the definition from the dbGaP data dictionary itself. For example, here's a link to the data dictionary for the data in the organoid_profiling_pc_subject_phenotypes_gru table.
Note that if you open the link in a web browser it will display as an html table. For API purposes you can read the xml programmatically. One approach is to read the dictionary visually and translate the data in Python or R.
There are some mapping tables that can help which I will add to GA4Gh Search.
Created these two tables to do a mapping. search_cloud.cshcodeathon.md_mapping search_cloud.cshcodeathon.md_mapping_term Working on an example to use them.
Mapping example added.
Now need to map more columns!
Added one table so far with unique codings for sex and race. Column (variable) names also unique The table so far search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru
Aiming for three or four such tables from dbGaP. The codings and column names will vary.
The question is: how the machine readable information (schema) provided about each table can help make it easier for a data scientist? We assume they are using tools such as python or R and can transform the data in those tools quite easily as long as they have the information to do so. /table/tablename/info provides that information.
Note that in dbGaP the data used in the table above is controlled access. The dataset available through the GA4GH Search API uses values from the dataset but each record (row) is a simulated example - not a real record.