Make smartBag Support Table Joins

ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development

BSD 3-Clause "New" or "Revised" License

2 stars 1 forks source link

Make smartBag Support Table Joins #120

Open stevencox opened 6 years ago

stevencox commented 6 years ago

smartBag can generate a smartAPI from a BDBag.

But it's very simple and does not support API endpoints that require joining tabular data from multiple files.

stevencox commented 6 years ago

@tubafrenzy, please assign a milestone date to this item and update status in the issue.

tubafrenzy commented 6 years ago

This will be finished by the end of next week.

tubafrenzy commented 6 years ago

@stevencox Do you have a sample join that I could use for testing and development purposes? Seems like the API will need to accept parameters that indicate which column on one data set to join to which column on another data set. For now I am playing around with a dummy metadata file keyed off of the bicluster "index_id" field.

stevencox commented 6 years ago

All you need to design the feature is any column shared by two input files.

CTD_chem_gene_ixns.csv header:

# Fields:
# ChemicalName,ChemicalID,CasRN,GeneSymbol,GeneID,GeneForms,Organism,OrganismID,Interaction,InteractionActions,PubMedIDs

CTD_chemicals.csv header:


# Fields:
# ChemicalName,ChemicalID,CasRN,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms,DrugBankIDs

The generated service should allow a query by ChemicalID to return data joining CTD_chemicals and CTD_chem_gene_ixns data. Assume column names are the same.

tubafrenzy commented 6 years ago

Noticed that CTD_chem_gene_ixns.csv contains data of the form:

MESH:C533344

while CTD_chemicals.csv seems to have the prefix stripped off:

C025205

This discrepancy isn't completely germane to the development I am doing, but it would mean these tables don't join properly in a demo/example.

Also, as I've been going down this road, I assume the API shoule be able to represent both one-to-one and many-to-one relationships from the perspective of both table queries? Or should they be cleanly married into a single denormalized-type table result from the "many" perspective, with duplicated "one" rows per line?

stevencox commented 6 years ago

(a), the normalized, relational approach, not the denormalized.