bmir-radx / radx-project

This repo serves as a primary location for tracking issues that don't quite fit into our other dedicated repositories
0 stars 0 forks source link

NIH CDEs ingestion #3

Open marcosmro opened 11 months ago

marcosmro commented 11 months ago

@matthewhorridge: Reviewer @jkyu: Primary @egyedia : Secondary

marcosmro commented 11 months ago

September 2023

Accomplishments

marcosmro commented 11 months ago

Atti, Martin, and I agreed yesterday that this would be a good task for Jimmy to start getting familiar with both CEDAR and RADx. I've adjusted the roles, setting Jimmy as 'Primary' and Atti as 'Secondary'.

marcosmro commented 11 months ago

Note from John: "Note that re Jimmy Yu's work on XA6.4 (CDE migration into CEDAR), we need to attend to the agreement we had with the NIH CDE Repository to prioritize their 'approved' CDEs to users over "just any CDEs" from their repo."

jkyu commented 11 months ago

I started a repo for this here: https://github.com/bmir-radx/cedar-nih-tools (the name mimics the cedar-cadsr-tools that ingested the caDSR CDEs).

This currently has a very basic function of reading in a .json exported from the NIH CDE repository picking out a few relevant metadata fields and populating them in a CEDAR field using builders from the CEDAR artifact library.

There are some complexities that we'll need to address:

There are also some more general software concerns that I'll keep working on, such as:

jkyu commented 10 months ago
jkyu commented 10 months ago

As a test, I tried to post the attached json schema (test_cde.json) generated by the current ingestion code. I get a validation error due to missing required properties (message below). I tried to provide the minimum info required to the cedar-artifact-library and I thought that default values would be set for things like createdBy and createdOn, but it looks like I need to map these values explicitly. I'll work on filling in these holes.

test_cde.json

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>
{"errorKey":"invalidData","objects":{"validationReport":{"validates":"false","warnings":[],"errors":[{"message":"object has missing required properties (['requiredValue'])","location":"/_valueConstraints","additionalInfo":{"schemaFile":"#","schemaPointer":"/definitions/literalFieldValueConstraintsContent"}},{"message":"object has missing required properties (['oslc:modifiedBy','pav:createdBy','pav:createdOn','pav:lastUpdatedOn','schema:description','schema:name','skos:altLabel','skos:prefLabel'])","location":"/@context","additionalInfo":{"schemaFile":"#","schemaPointer":"/definitions/templateFieldJSONLDContextFieldContent"}}]}},"errorMessage":"object has missing required properties (['requiredValue'])\nobject has missing required properties (['oslc:modifiedBy','pav:createdBy','pav:createdOn','pav:lastUpdatedOn','schema:description','schema:name','skos:altLabel','skos:prefLabel'])\n","parameters":{},"errorReasonKey":"validationError","status":"BAD_REQUEST","statusCode":400}%
martinjoconnor commented 10 months ago

There is a wrinkle here (and I may have to address it in the library). In order to post to CEDAR, several properties have to be set to null. See:

https://metadatacenter.readthedocs.io/en/latest/developer-guide/template-element-and-fields/

Try manually for the moment. I will look at adding these to artifacts in the library.

martinjoconnor commented 10 months ago

There is also a validation endpoint.

e.g.,

curl -X POST --data-binary @MyFile.json --header "Authorization: apiKey XXX" --header 'Accept: application/json' 'https://resource.metadatacenter.org/command/validate?resource_type=field'
jkyu commented 10 months ago

I tried manually adding the required fields listed in the error messages. The POST works now. I tried a few permutations of the fields that needed to be added. It looks like I was missing this block under @context

    "pav:createdOn": {
      "@type": "xsd:dateTime"
    },
    "pav:createdBy": {
      "@type": "@id"
    },
    "oslc:modifiedBy": {
      "@type": "@id"
    },
    "pav:lastUpdatedOn": {
      "@type": "xsd:dateTime"
    },
    "schema:description": {
      "@type": "xsd:string"
    },
    "skos:prefLabel": {
      "@type": "xsd:string"
    },
    "skos:altLabel": {
      "@type": "xsd:string"
    },
    "schema:name": {
      "@type": "xsd:string"
    }

I was setting schema:name, schema:description, skos:prefLabel and skos:altLabel in the code, but I did also need to include pav:createdBy, pav:createdOn, pav:lastUpdatedOn, and oslc:modifiedBy. I started by assigning the date or URI manually, but I found that CEDAR overwrites them anyway when it creates the field. I repeated the POST including the missing fields and setting them to null and that worked -- the null fields were populated by CEDAR. These are in fact optional fields but I think they need to show up in the JSON schema with placeholder or nulls to be accepted by CEDAR. I think I can deal with this by using the appropriate builder function and passing in an empty Optional.

I ran into another error in this set of JSON fields: "/properties/@type/oneOf/<index>/format". The format is set to termUri by the cedar-artifact-library and it seems like it needs to be just uri to pass validation. I'm not sure if this one is user error or a bug.

matthewhorridge commented 10 months ago

There is a wrinkle here (and I may have to address it in the library). In order to post to CEDAR, several properties have to be set to null. See:

https://metadatacenter.readthedocs.io/en/latest/developer-guide/template-element-and-fields/

Try manually for the moment. I will look at adding these to artifacts in the library.

Any chance of making this easier by allowing the properties to be omitted? I don't really see the difference between forcing

{ "FirstName" : "John", "MiddleName" : null, "LastName" : "Smith" }

and

{ "FirstName" : "John", "LastName" : "Smith" }

(except that the last one is better IMO

jkyu commented 10 months ago

I think I can deal with this by using the appropriate builder function and passing in an empty Optional.

Turns out I can't do this with the builder -- the constructor for FieldSchemaArtifact takes Optional but the builder does not.

In order to post to CEDAR, several properties have to be set to null.

I found the code where the renderer does this here. We would have to make the code write regardless (either the value or a null). I'm not sure if something breaks if we do this for all optional fields, but I think I know what the minimal set of fields should be.

{ "FirstName" : "John", "LastName" : "Smith" }

I agree that this would be better, but I think it'll require an update to the service.

jkyu commented 10 months ago

Talked to @martinjoconnor about the changes that need to be made in the cedar-artifact-library to support this. I'm going to take a stab at the code change and have him review it later.

jkyu commented 10 months ago

From the CEDAR dev sync:

jkyu commented 10 months ago

The tool mostly works now. It converts CDEs exported from the NIH CDE repository in json format into a format that CEDAR accepts and then also POSTs the outputs to CEDAR. I added handling for input constraints (which I think are correct), and I made the necessary fixes/changes to the cedar artifact library to handle this.

There is an issue where a POST request can time out. I think this is normal, since this program accesses an external resource. I tried setting the connection and socket timeouts to 10 seconds and still saw this issue. I think we would need to implement a retry strategy with some backoff or some checkpointing mechanism to prevent duplicate CDEs (for manual retries) if we want to fix this.

Still need:

matthewhorridge commented 10 months ago

Where does the timeout come from? I've never had these problems with CEDAR. I have had to rate limit calls to BioPortal previously though.

marcosmro commented 10 months ago

I'm not aware of any CEDAR timeout caused by a POST request. @jkyu perhaps discuss that issue with @egyedia, he'll know the cause of it.

jkyu commented 10 months ago

I've gotten SocketException and TimeoutException. I added a retry mechanism that handles this, but I'll undo it and get the exact stack traces to figure out if something is wrong.

I'm sending each CDE to CEDAR in a separate POST request and I've been getting one timeout every ~450 CDEs.

martinjoconnor commented 10 months ago

Defintely get @egyedia to take a look at this.

jkyu commented 10 months ago

This is the stack trace after a few thousand validation requests:

Exception in thread "main" java.net.ConnectException: Operation timed out
    at java.base/sun.nio.ch.Net.connect0(Native Method)
    at java.base/sun.nio.ch.Net.connect(Net.java:579)
    at java.base/sun.nio.ch.Net.connect(Net.java:568)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at java.base/sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:174)
    at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:183)
    at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:533)
    at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:638)
    at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)
    at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:380)
    at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:193)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128)
    at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:179)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1430)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1401)
    at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:220)
    at org.metadatacenter.nih.ingestor.poster.Poster.validate(Poster.java:41)
    at org.metadatacenter.nih.ingestor.NIHCDEConverter.main(NIHCDEConverter.java:34)

This happened around 3:30 pm on Monday Oct 30 and I tried to post this JSON:

{"pav:createdBy":null,"oslc:modifiedBy":null,"pav:createdOn":null,"pav:lastUpdatedOn":null,"@type":"https://schema.metadatacenter.org/core/TemplateField","@id":null,"$schema":"http://json-schema.org/draft-04/schema#","type":"object","title":"Quality of Life - Upper extremity reach and get down 5 pound object from above head scale field schema","description":"Quality of Life - Upper extremity reach and get down 5 pound object from above head scale field schema generated by the CEDAR Artifact Library","schema:name":"Quality of Life - Upper extremity reach and get down 5 pound object from above head scale","schema:description":"The scale which represents the extent the participant was able to reach and get down a five pound object from above their head, as a part of Patient-Reported Outcome Measurement Information System (PROMIS) Upper Extremity.","schema:schemaVersion":"1.6.0","schema:identifier":"Q1WPRa56TU","pav:version":"0.0.1","bibo:status":"bibo:draft","@context":{"schema":"http://schema.org/","pav":"http://purl.org/pav/","xsd":"http://www.w3.org/2001/XMLSchema#","skos":"http://www.w3.org/2004/02/skos/core#","bibo":"http://purl.org/ontology/bibo/","oslc":"http://open-services.net/ns/core#","schema:name":{"@type":"xsd:string"},"schema:description":{"@type":"xsd:string"},"pav:createdOn":{"@type":"xsd:dateTime"},"pav:createdBy":{"@type":"@id"},"pav:lastUpdatedOn":{"@type":"xsd:dateTime"},"oslc:modifiedBy":{"@type":"@id"},"skos:prefLabel":{"@type":"xsd:string"},"skos:altLabel":{"@type":"xsd:string"}},"properties":{"@type":{"oneOf":[{"type":"string","format":"uri"},{"type":"array","minItems":1,"items":{"type":"string","format":"uri"},"uniqueItems":true}]},"rdfs:label":{"type":["string","null"]},"@value":{"type":["string","null"]}},"additionalProperties":false,"_valueConstraints":{"literals":[{"label":"5 - Without any Difficulty"},{"label":"4 - With a Little Difficulty"},{"label":"3 - With Some Difficulty"},{"label":"2 - With Much Difficulty"},{"label":"1 - Unable to Do"}],"requiredValue":false,"multipleChoice":true},"skos:prefLabel":"Quality of Life - Upper extremity reach and get down 5 pound object from above head scale","skos:altLabel":["Quality of Life - Upper extremity reach and get down 5 pound object from above head scale"],"_ui":{"inputType":"list"}}

This might not matter because it looks like the timeout happened on connection. I'll let @egyedia know.

egyedia commented 10 months ago

Could you please try to get the resource server log/error that shows up when this error occurs on your end? I suppose there should be one.

egyedia commented 10 months ago

Are you posting the CDEs to the prod server, or your local dev server?

jkyu commented 10 months ago

I'm posting to the prod server. I am getting a response code 200 for the connection (which is good). The connection times out instead of failing or completing, which causes the timeout exceptions. Here's the response:

{"validates":"false","warnings":[],"errors":[{"message":"object has missing required properties (['$schema','@context','@id','@type','_ui','_valueConstraints','additionalProperties','description','oslc:modifiedBy','pav:createdBy','pav:createdOn','pav:lastUpdatedOn','properties','schema:description','schema:name','schema:schemaVersion','title','type'])","location":"/","additionalInfo":{"schemaFile":"#","schemaPointer":""}}]}

The response looks like I'm sending a json object that is missing all of the required content, but the POST succeeds if it's retried. It doesn't look like this is an error caused by any specific CDE. This time around, it failed on the below CDE. Logging it here in case I see it cause another error.

{"pav:createdBy":null,"oslc:modifiedBy":null,"pav:createdOn":null,"pav:lastUpdatedOn":null,"@type":"https://schema.metadatacenter.org/core/TemplateField","@id":null,"$schema":"http://json-schema.org/draft-04/schema#","type":"object","title":"Headache Impact Test 6 follow-up score field schema","description":"Headache Impact Test 6 follow-up score field schema generated by the CEDAR Artifact Library","schema:name":"Headache Impact Test 6 follow-up score","schema:description":"Score obtained from a follow-up administration of the Headache Impact Test-6 (HIT-6)","schema:schemaVersion":"1.6.0","schema:identifier":"m1Qq_i668","pav:version":"0.0.1","bibo:status":"bibo:draft","@context":{"schema":"http://schema.org/","pav":"http://purl.org/pav/","xsd":"http://www.w3.org/2001/XMLSchema#","skos":"http://www.w3.org/2004/02/skos/core#","bibo":"http://purl.org/ontology/bibo/","oslc":"http://open-services.net/ns/core#","schema:name":{"@type":"xsd:string"},"schema:description":{"@type":"xsd:string"},"pav:createdOn":{"@type":"xsd:dateTime"},"pav:createdBy":{"@type":"@id"},"pav:lastUpdatedOn":{"@type":"xsd:dateTime"},"oslc:modifiedBy":{"@type":"@id"},"skos:prefLabel":{"@type":"xsd:string"},"skos:altLabel":{"@type":"xsd:string"}},"properties":{"@type":{"oneOf":[{"type":"string","format":"uri"},{"type":"array","minItems":1,"items":{"type":"string","format":"uri"},"uniqueItems":true}]},"rdfs:label":{"type":["string","null"]},"@value":{"type":["string","null"]}},"additionalProperties":false,"_valueConstraints":{"numberType":"xsd:int","minValue":78,"requiredValue":false},"skos:prefLabel":"Headache Impact Test 6 follow-up score","skos:altLabel":["Headache Impact Test 6 follow-up score"],"_ui":{"inputType":"numeric"}}
jkyu commented 10 months ago

Running it again yields response code 200 again. The CDE causing it is different (this one passed last time).

{"pav:createdBy":null,"oslc:modifiedBy":null,"pav:createdOn":null,"pav:lastUpdatedOn":null,"@type":"https://schema.metadatacenter.org/core/TemplateField","@id":null,"$schema":"http://json-schema.org/draft-04/schema#","type":"object","title":"Childbearing potential indicator field schema","description":"Childbearing potential indicator field schema generated by the CEDAR Artifact Library","schema:name":"Childbearing potential indicator","schema:description":"Indicator of whether the participant/subject is of childbearing potential.","schema:schemaVersion":"1.6.0","schema:identifier":"KHmtAVGc1Vk","pav:version":"0.0.1","bibo:status":"bibo:draft","@context":{"schema":"http://schema.org/","pav":"http://purl.org/pav/","xsd":"http://www.w3.org/2001/XMLSchema#","skos":"http://www.w3.org/2004/02/skos/core#","bibo":"http://purl.org/ontology/bibo/","oslc":"http://open-services.net/ns/core#","schema:name":{"@type":"xsd:string"},"schema:description":{"@type":"xsd:string"},"pav:createdOn":{"@type":"xsd:dateTime"},"pav:createdBy":{"@type":"@id"},"pav:lastUpdatedOn":{"@type":"xsd:dateTime"},"oslc:modifiedBy":{"@type":"@id"},"skos:prefLabel":{"@type":"xsd:string"},"skos:altLabel":{"@type":"xsd:string"}},"properties":{"@type":{"oneOf":[{"type":"string","format":"uri"},{"type":"array","minItems":1,"items":{"type":"string","format":"uri"},"uniqueItems":true}]},"rdfs:label":{"type":["string","null"]},"@value":{"type":["string","null"]}},"additionalProperties":false,"_valueConstraints":{"literals":[{"label":"Yes - Yes"},{"label":"No - No"},{"label":"Unknown - Unknown"}],"requiredValue":false,"multipleChoice":true},"skos:prefLabel":"Childbearing potential indicator","skos:altLabel":["Childbearing potential indicator"],"_ui":{"inputType":"list"}}

Here's the response body from the POST request (same as last time):

{"validates":"false","warnings":[],"errors":[{"message":"object has missing required properties (['$schema','@context','@id','@type','_ui','_valueConstraints','additionalProperties','description','oslc:modifiedBy','pav:createdBy','pav:createdOn','pav:lastUpdatedOn','properties','schema:description','schema:name','schema:schemaVersion','title','type'])","location":"/","additionalInfo":{"schemaFile":"#","schemaPointer":""}}]}

And the exception is the same as last time:

Exception in thread "main" java.net.ConnectException: Operation timed out
jkyu commented 10 months ago

Progress update / task summary:

We developed a Java library with a command line interface that ingests CDEs exported from the NIH CDE repository. All of the required information to specify the CDE, including datatypes, labels, and constraints, are covered. The ingested CDEs can be accessed here: https://cedar.metadatacenter.org/dashboard?folderId=https:%2F%2Frepo.metadatacenter.org%2Ffolders%2F374c6bb4-69ce-4366-bc36-6c07bea55548. For now, any future additions to the NIH CDE repository will need to be manually ingested by running this tool.

TODO:

martinjoconnor commented 10 months ago

I'm guessing the the CDEs had no version information? NCI CDEs did.

I wondering if we want to indicate that these are trusted artifacts in the same we as we do for NCI CDEs?

https://cedar.metadatacenter.org/dashboard?folderId=https:%2F%2Frepo.metadatacenter.org%2Ffolders%2F1ee5ef41-0605-4c18-9054-b01eb4290339

jkyu commented 10 months ago

There might be a hidden version tag. I see a __v field that I did not do anything with (although some of these have very large numbers, like 4353). I can add these in.

I was running the ingestion, but I canceled it. I think it might be a good idea to review what I have so far before creating all of the fields in CEDAR, since I might be missing some stuff (like versions).

jkyu commented 10 months ago

From 2:37 to 2:55, 2600 CDEs were ingested. There are around 28000 total CDEs in the full set of NIH CDEs, so extrapolating from this, I would expect ingestion to take around 3 hours.

martinjoconnor commented 10 months ago

CEDAR will only allow semantic versioning. We could arbitrarily make everything 1.0.0, though am not sure about this. I do think we need to make all of these non-draft, though.

As you suggested, let's review (perhaps at RADx technical meeting next week).

jkyu commented 10 months ago

Sounds good. I'll hold off on pushing these into CEDAR until we review next week.

I see an API command for publishing, so I'll figure that out on some of these in my personal folder.

marcosmro commented 10 months ago

I agree with @martinjoconnor that it would be good to make those CDEs 'trusted artifacts'. That can be done by updating the CEDAR_TRUSTED_FOLDERS environment variable in production with the id of the NIH CDEs folder. Additionally, it would be desirable to add that folder to the root of 'Community Folders', as we did with the NCI CDEs.

@egyedia will know how to do both things.

jkyu commented 10 months ago

I worked with @egyedia last week to put the NIH CDEs folder at the "Community Folders" root and add "Trusted by NIH CDE Repository" badges to the CEDAR fields in that folder.

@marcosmro brought up an issue with the Value List CDEs allowing multi-select. I fixed this and republished the CDEs over the weekend. They're located here: https://cedar.metadatacenter.org/dashboard?folderId=https:%2F%2Frepo.metadatacenter.org%2Ffolders%2F374c6bb4-69ce-4366-bc36-6c07bea55548