cancerDHC / tools

A repository for the work of the Tools workstream for CCDH
2 stars 1 forks source link

Build a pipeline for converting CRDC-H from Google Sheets to LinkML #32

Closed gaurav closed 3 years ago

gaurav commented 3 years ago

The CRDC-H is currently being developed as a series of Google Sheets. @jiaola has a script for converting these sheets into a LinkML representation in https://github.com/cancerDHC/ccdhmodel, but is looking for some help from the Tools team in terms of taking this over and then working closely with the LinkML community to adapt the generated LinkML as the specification is developed. I think it will be useful to come up with a workflow for doing this that will allow us to update the LinkML pretty quickly as the Google Sheet representation changes.

fragosog commented 3 years ago

Hi, just a feature request for the tool once you've got it working with google sheets. Since there is a restriction on NCI's use of google docs, would like to suggest a future enhancement for the tool to deal with excel sheets as well. Maybe I haven't thought through this fully since the idea is to have the master model, if you will, in a shareable location, but just in case.

mellybelly commented 3 years ago

Thanks @fragosog. Google sheets does have functionality not available in Excel and is our approved contracted workplace. That said, we can of course create digests in excel, but I would be concerned about using it as a source of truth.

What is the restriction? Can we work on overcoming that simultaneously?

fragosog commented 3 years ago

Oh yes, @mellybelly , don't want to interfere in the initial version of the tool. Just a feature request for when v1 (vN?) is done.

Yes, google suite has not been approved for NCI work and we are highly discouraged from using it.

gaurav commented 3 years ago

I met with Dazhi today, and he gave me an overview of his tool, which is part of a larger toolset for computing over the mappings from the Google Sheets as a Neo4J database, and hosted at https://github.com/HOT-Ecosystem/crdc-node-models. I'll try running that code on my computer next week and think about whether it makes sense to integrate that into the example data/model validation pipeline we're currently building. I guess the answer to that would depend on whether development of the CRDC-H model after the v1.0 release will happen in Google Sheets or on the LinkML representation of the model via Github issues.

gaurav commented 3 years ago

We now have an initial script for generating the CRDC-H model in LinkML from the Google Sheets description (PR https://github.com/cancerDHC/ccdhmodel/pull/13). I've started thinking about an automated pipeline for regenerating the CRDC-H model in LinkML would look like (https://github.com/cancerDHC/ccdhmodel/issues/16). Once we have a plan for what the pipeline should look like in the short term (probably operated manually by me), medium term (probably using GitHub Actions to automate regenerating the LinkML model) and long term (have development on the model be carried out within the ccdhmodel repository itself), I think we can go ahead and close this issue.

balhoff commented 3 years ago

I think this is complete in terms of building a pipeline for the conversion. Issues relating to increasing automation, such as with GitHub Actions, can be managed at https://github.com/cancerDHC/ccdhmodel/issues/16.