cancerDHC / example-data

This repository is intended to act as a store of example data files from across the NCI Cancer Research Data Commons (CRDC) nodes in a number of formats.
MIT License
0 stars 3 forks source link

Add CCDH Pilot examples #22

Closed gaurav closed 2 years ago

gaurav commented 2 years ago

Adds the CCDH Pilots to the Example Data Repository. Some of these don't work, probably because of bugs in LinkML Runtime (https://github.com/cancerDHC/example-data/issues/39). Others either can't be set up with the Python Data Classes or cannot be validated when they are (https://github.com/cancerDHC/example-data/issues/40).

Changes requested:

Should be merged after PR #41

turbomam commented 2 years ago

I did poetry update and poetry run pytest and all tests passed.

I get the following warning, as I do in some other LinkML base projects.

head-and-mouth/test_load.py::test_transform_gdc_data /Users/MAM/Library/Caches/pypoetry/virtualenvs/crdch-example-workflows-ma-Vv354-py3.9/lib/python3.9/site-packages/rdflib_jsonld/init.py:9: DeprecationWarning: The rdflib-jsonld package has been integrated into rdflib as of rdflib==6.0.0. Please remove rdflib-jsonld from your project's dependencies. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html

turbomam commented 2 years ago

Is there any documentation about where the YAMl data files in ccdh-pilot came from? I think it was Frankensteination of real values from various records from the GDC and PDC backends, and that it was performed by the Data Harmonization team. I think we should include some verbiage or linkage in a README in this folder.

turbomam commented 2 years ago

@gaurav , when you say that some of the Pilot Demonstrators "don't work", do you mean that they don't validate on ingestions, or that they couldn't be created programmatically right now because of the decimal types incompatibility?

I'll be trying to answer this question for myself and possibly writing a test now.

gaurav commented 2 years ago

@gaurav , when you say that some of the Pilot Demonstrators "don't work", do you mean that they don't validate on ingestions, or that they couldn't be created programmatically right now because of the decimal types incompatibility?

Sorry, I should have been clearer! The YAML files included in this PR all "work" in that they pass validation, but I had to comment out some lines that either don't pass validation or that couldn't be loaded using the Python Data Classes. I think this is almost entirely because of https://github.com/cancerDHC/ccdhmodel/issues/154, so I don't know that new tests are necessary. I have a list of all non-validating fields: #39