Generate list of CINECA terms for each cohort

jamesaoverton commented 4 years ago

As discussed on our 2020-03-27 call, we need to supply a list of the cohorts and the data they collect using the CINECA model. I think the following JSON will work:

object
- key: cohort name string
- value: array of CINECA label strings

{
  "KoGES": ["biological sex", "blood", "urine", ...],
  "Genomics England": ["biological sex", "blood", "saliva", ...],
  ...
}

We will generate this data from more detailed ontology mappings.

@mcourtot Does this sound right?

jamesaoverton commented 4 years ago

@beckyjackson generated this full version for KoGES from the latest mappings. Note that when a specific category such as "urine" is included, we also include its ancestors "sample type" and "biosample".

{
  "KoGES": [
    "questionnaire/survey data",
    "lifestyle and behaviours",
    "microbial data",
    "microbiology",
    "laboratory measures",
    "sample type",
    "biosample",
    "urine",
    "blood",
    "physiological measurements",
    "anthropometry",
    "diseases",
    "socio-demographic and economic characteristics",
    "circulation and respiration",
    "family and household structure",
    "weight",
    "height",
    "fasting or non-fasting",
    "heart rate",
    "blood pressure"
  ]
}

rosibaj commented 4 years ago

@jamesaoverton I have reviewed the proposed JSON above.

It would be best if we could get this in a slightly different format. I have a example here of how the JSON for each cohort could be delivered: https://github.com/hlminh2000/ihcc_demo/blob/master/api/assets/documents.json

I have divided this JSON into two areas: cohort administration (which i know you may not have access to right now). If you don't have this information, I can fill this in manually for the demo. The relevant areas that you can help produce are these, which I mapped from the Cineca model:

cohort_attributes
biosample
laboratory

Please let me know if you have any questions!

beckyjackson commented 4 years ago

I created a new script to generate something like that example for KoGES. I'm working on mapping GCS (Golestan Cohort Study) now, so I can have a second example soon. Let me know if this looks good, or if you see anything that could use fixing!

Edit: just added GCS. Please note that this is not fully mapped yet, so this doesn't represent all the variables they collect.

[
  {
    "cohort_name": "Korean Genome and Epidemiology Study (KoGES)",
    "countries": [
      "South Korea",
      "Vietnam",
      "Cambodia",
      "Japan",
      "China"
    ],
    "pi_lead": "Sung Soo Kim",
    "website": "http://www.nih.go.kr/NIH/eng/contents/NihEngContentView.jsp?cid=65199&menuIds=HOME004-MNU2261-MNU2262-MNU2263-MNU2264",
    "biosample": {
      "sample type": [
        "blood",
        "urine"
      ]
    },
    "laboratory measures": {
      "microbiology": [
        "microbial data"
      ]
    },
    "questionnaire/survey data": {
      "lifestyle and behaviours": [
        "alcohol",
        "sleep"
      ],
      "physiological measurements": {
        "anthropometry": [
          "height",
          "weight"
        ],
        "circulation and respiration": [
          "blood pressure",
          "heart rate"
        ]
      },
      "socio-demographic and economic characteristics": [
        "education",
        "family and household structure",
        "occupation"
      ]
    }
  },
  {
    "cohort_name": "Golestan Cohort Study",
    "countries": [
      "Iran"
    ],
    "pi_lead": "Reza Malekzadeh, Christian Abnet, Paolo Boffetta, Paul Brennan, Farin Kamangar, Arash Etemadi",
    "website": "https://dceg2.cancer.gov/gemshare/",
    "basic cohort attributes": [
      "demographic data"
    ],
    "biosample": {
      "sample type": [
        "blood"
      ]
    },
    "questionnaire/survey data": [
      "signs and symptoms"
    ]
  }
]

jamesaoverton commented 4 years ago

@rosibaj Is this format acceptable?

@mcourtot We're generating this JSON from this table https://ihcc.g2mc.org/membercohorts/ (despite some weirdness) and our mappings. The mappings are work in progress, but we could generate fake mappings. Then we could supply JSON for a large number of cohorts from that table. Would that be helpful?

mcourtot commented 4 years ago

Do we want to have one gigantic list, or do we want to split? Most projects make a distinction between cohort attributes and data, see for example https://portal.dementiasplatform.uk/CohortMatrix

Maelstrom also clearly divides activities around cohort cataloguing and metadata harmonisation.

@rosibaj what is the desired format for you?

rosibaj commented 4 years ago

I have reviewed the above.

@mcourtot im not sure what you mean by one giant list or split, but one file of all the cohorts as objects (presented above) will be fine for the demo.

@beckyjackson, this format looks fine. For the metadata from the member cohort list (https://ihcc.g2mc.org/membercohorts/ ), is it also possible to include the "available data types" as an array at the top level? i.e This data: formatted as

"available_data_types": ["Genomic Data", "Environmental Data"...],

Once you have a final list of all available fields then I can update the comprehensive mapping on our end.

jamesaoverton commented 4 years ago

I think it makes sense to send one JSON file to @rosibaj. We'll add the available_data_types.

jamesaoverton commented 4 years ago

@rosibaj I hope this file works for you:

https://github.com/jamesaoverton/IHCC/blob/master/data/cohort-data.json

If not, feel free to reopen this issue or open a new issue.

Note that only the KoGES mapping is complete, GCS is partial, and the other mappings are faked for testing purposes. We will update that file as we progress with the mappings.

IHCC-cohorts / data-harmonization

Generate list of CINECA terms for each cohort #29