IHCC-cohorts / data-harmonization

International HundredK+ Cohorts Consortium (IHCC) Data Harmonization
Apache License 2.0
1 stars 1 forks source link

Updates to data for May demo #44

Closed rosibaj closed 4 years ago

rosibaj commented 4 years ago

Based on the April 17th meeting, we will want to do one update of the data before the next demo.

These updates will include:

From the JSON provided, we had to fix a couple things:

  1. JSON strucure must be consistent between all objects. A category cannot be both an array and an object. Example: https://github.com/jamesaoverton/IHCC/blob/54947b2edec2e4fd93dc899afbcba3a6008877b0/data/cohort-data.json#L31 https://github.com/jamesaoverton/IHCC/blob/54947b2edec2e4fd93dc899afbcba3a6008877b0/data/cohort-data.json#L76
  2. No special characters in field names & only underscores Examples: https://github.com/jamesaoverton/IHCC/blob/54947b2edec2e4fd93dc899afbcba3a6008877b0/data/cohort-data.json#L118
  3. Change the available_data_types array to individual boolean fields at the top level; assume empty in the csv is false. Ex:
    {
    "genomic_data": true,
    "clinical_data": true,
    "phenotype_data": false
    ....
jamesaoverton commented 4 years ago

Thanks! @beckyjackson will make those changes, then add all the cohorts from https://github.com/jamesaoverton/IHCC/blob/master/data/member_cohorts.csv

I guess there's 67 rows, for 66 cohorts.

rosibaj commented 4 years ago

@beckyjackson one additional request to change the available_data_types format (edited as No. 3 above)

beckyjackson commented 4 years ago

@rosibaj - for the consistency, would something like this work for you, or do you want the lowest-level always to be a list?

"questionnaire/survey data": {
  "lifestyle and behaviours": {
    "alcohol": {},
    "nutrition": {},
    "sleep": {},
    "tobacco": {}
  }, ...
rosibaj commented 4 years ago

@beckyjackson No, what you have proposed above would not work, unless each of the "alcohol": {}, has additional elements in the ontologies that are assigned afterwards. So i think the best assumption here is:

Yes,

beckyjackson commented 4 years ago

For many children of 'questionnaire/survey data', there are next-level children, like this example:

"questionnaire/survey data": {
  "lifestyle and behaviours": [
    "alcohol",
    "tobacco"
  ]
}

But the problem comes if a cohort only looks at signs and symptoms from questionnaire/survey data, because this one doesn't have any children:

"questionnaire/survey data": [
    "signs and symptoms"
]

How would you like the data to show up in the above case, where only 'signs and symptoms' shows up under 'questionnaire/survey data'?

Thanks for your help!

EDIT: one idea would be to put null on any that don't have children, but that might cause the same problem as the empty dictionary.

rosibaj commented 4 years ago

@beckyjackson Its imperative that the JSON structure between cohorts be exactly the same. This means that we definitely cannot have "questionnaire/survey data" as both an array and an object.

This is a harmonization issue with these options as solutions that I can see for now:

  1. ignore the values on cohorts with no children
  2. harmonize a single structure that works for all cohorts

2) seems the better option to me. We can do a structure like this:

Cohort 1 (has children)

"questionnaire/survey data": {
      "lifestyle_and_behaviours": [
         "alcohol",
         "tobacco"
       ],
      "physiological_measurements": [
          "height",
          "weight"
        ],
      "general_variables": null,
}

Cohort 2 (has no children) so populate the general variables array with those values.

"questionnaire/survey data": {
      "lifestyle_and_behaviours": null 
       "physiological_measurements": null
       "general_variables": ["signs and symptoms"] <-- group all possible cases were there are no child 
}

Cohort 3 ( also has no children)

"questionnaire/survey data": {
      "lifestyle_and_behaviours": null 
       "physiological_measurements": null
       "general_variables": ["any_potential_value"] <-- group all possible cases were there are no child 
}

Note that the structure should remain the same in all cases. Its not needed to populate the null fields (as that can be taken care of automatically), but it is important that the basic structure in all cohort documents is identical.

If there are other cases of cohorts having non-child entities in other objects, this same methodology can be used.

Does this make sense?

beckyjackson commented 4 years ago

I want to make sure I'm interpreting this correctly:

If a cohort looks at only signs and symptoms, they get:

"questionnaire/survey data": {
    "general_variables": ["signs and symptoms"]
}

If a cohort looks at just medication (not worrying about the sub-categories, therefore this gets null since it has children, just not applicable for this cohort) and signs and symptoms they get:

"questionnaire/survey data": {
    "medication": null,
    "general_variables": ["signs and symptoms"]
}

Finally, if a cohort were to look at posology from medication, they would get:

"questionnaire/survey data": {
    "medication": ["posology"]
}
rosibaj commented 4 years ago

@beckyjackson i think that looks correct based on your description of cohorts!

beckyjackson commented 4 years ago

Great - following this pattern, I regenerated the cohort data for what we currently have. If you have a chance, can you take a look at it and make sure it's what you expect? https://github.com/jamesaoverton/IHCC/blob/data-update/data/cohort-data.json

Thanks so much!

rosibaj commented 4 years ago

@beckyjackson I have reviewed - This structure looks great but there is still one small issue in the naming of fields:

Can we please remove spaces/special characters from fields names? For examples:

rosibaj commented 4 years ago

@beckyjackson I have reviewed - This structure looks great but there is still one small issue in the naming of fields:

Can we please remove spaces/special characters from fields names? For examples:

rosibaj commented 4 years ago

@beckyjackson I have reviewed - This structure looks great but there is still one small issue in the naming of fields:

Can we please remove spaces/special characters from fields names? For examples:

rosibaj commented 4 years ago

@beckyjackson I have reviewed - This structure looks great but there is still one small issue in the naming of fields:

Can we please remove spaces/special characters from fields names? For examples:

rosibaj commented 4 years ago

@beckyjackson sorry for the multiple comments! was a result of the github incident!

beckyjackson commented 4 years ago

No worries! I couldn't even respond haha. I updated the file in the PR to remove the spaces and special characters.

jamesaoverton commented 4 years ago

@rosibaj We're hoping that https://github.com/jamesaoverton/IHCC/blob/master/data/full-cohort-data.json is exactly what you want. If you have any trouble with it, please reopen this issue.

rosibaj commented 4 years ago

@jamesaoverton It seems i dont have the permission to repoen this issue. There are 2 small things: Is it possible to have these fixed?

beckyjackson commented 4 years ago

Hi @rosibaj - thanks for catching that typo, that's an easy fix to make!

As for your second point, I'm not sure I understand what you're asking.

In current structure, if a cohort looks at a broad category that has children, but the cohort doesn't care about the children, we give that category a null to keep the structure the same. For example, if a cohort looks at weight:

"questionnaire_survey_data": {
    ...
    "anthropometry": ["weight"],
    ...
}

But if they just look at the broad category of "anthropometry", they get:

"questionnaire_survey_data": {
    ...
    "anthropometry": null,
    ...
}

If we were to include all fields, I worry you'd end up with two separate cases that get null values. I think you want all the top-level categories, correct? But if a cohort looks at just the broad category, it gets null. Then, it would also get null if it doesn't look at it at all.

Perhaps I'm not understanding your request correctly, so could you provide an example? Thank you so much!

rosibaj commented 4 years ago

@beckyjackson

I understand the distinction that you are making. In the long-run, i think a better way to do this is to actually assign a value (ex Not gathered) to be explicit about this. However, thats a discussion for a later time!

For the sake of the demo, can we just have the data formatted like this for now:

"questionnaire_survey_data": {
    ...
    "anthropometry": ["weight"],
    ...
}

But if they just look at the broad category of "anthropometry" rather than putting null, make it an empty array.

"questionnaire_survey_data": {
    ...
    "anthropometry": [],
    ...
}
beckyjackson commented 4 years ago

Sure, that's not a problem. Then, do you want the upper-level broad categories to all be displayed, and have null if they are not collected?

rosibaj commented 4 years ago

@beckyjackson yes, but populated as empty data. Using biosample as an example, i would expect to see this in the 'Genomics England / 100,000 Genomes Project"` cohort document:

https://github.com/jamesaoverton/IHCC/blob/96848a04b640e29d5e323c27f2502744d93324a5/data/cohort-data.json#L86

"biosample": {
         "sample_type": [ ] 
},
beckyjackson commented 4 years ago

I'm sorry, I'm still a bit confused. Going back to the antropometry data, say they don't collect any questionnaire/survey data, are you saying you would still want to display all the fields that appear in the CINECA structure, even though they don't have anything mapped to this:

"questionnaire_survey_data": {
    ...
    "anthropometry": [],
    ...
}

How does this differ from if they do collect anthropometry data, but don't have something mapped to the children (currently displayed as "anthropometry": null, but will be changed to "anthropometry": [])?

rosibaj commented 4 years ago

@beckyjackson In your examples, you are making the assumption that "anthropometry": [], and "anthropometry": null mean different things. In fact, in the way that you are translating this informatino they do mean different things.

In the faceted search display (based on Elasticsearch index) there is no functional difference between a null and empty field. Both would be an empty facet in the display (meaning the same thing: this data does not exist for this filter). Inconsistent documents are however harder to work with, which is why we prefer to at least have the empty data state.

I think that a possible solution to this is to include the category as a value of itself (see below) for examples:

  1. Cohort does not collect anything from questionaire and survey data:
    "questionnaire_survey_data": {
    ...
    "anthropometry": [No Data],
    ...
    }
  2. Cohort collects anthropometry but no children:
"questionnaire_survey_data": {
    ...
    "anthropometry": [anthropometry],
    ...
}
  1. Cohort collects anthropometry with children:
"questionnaire_survey_data": {
    ...
    "anthropometry": [weight, height],
    ...
}

However, this constitutes a larger data modelling question that is not in the scope of this demo and warrants more thought.

For the demo purposes, can we continue with the empty array display?

beckyjackson commented 4 years ago

I changed the null values to empty arrays and fixed the typo - please see the data file here.

jamesaoverton commented 4 years ago

@rosibaj We've updated https://github.com/jamesaoverton/IHCC/blob/master/data/full-cohort-data.json with those changes.