airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

data integration from ADC #57

Open bcorrie opened 3 months ago

bcorrie commented 3 months ago

Data integration work from the ADC is being performed under the https://github.com/airr-knowledge/ak-schema/tree/airr-export branch.

Particular code for converting data from the ADC into the AKC is here: https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/scripts/akc-convert

bcorrie commented 3 months ago

@schristley in order for the ADC to AKC conversion to work, the docker container needs the AIRR python module. When this code is merged in the Dockerfile will be updated, not sure if you want to add that sooner.

Commit to add this to the Dockerfile is here: https://github.com/airr-knowledge/ak-schema/pull/6/commits/6dc4559ca4cf56dee268213a8c3d108aecf998d1

If you want to try the code on this branch, you need to build a new docker container with the AIRR python module installed.

bcorrie commented 3 months ago

VERY basic docs on how to use the code is here: https://github.com/airr-knowledge/ak-schema/blob/airr-export/src/scripts/akc-convert/README.md

Example of study repertoire data extracted from the ADC using the following curl command:

curl -H 'content-type: application/json' -d '{"filters":{"op":"=", "content": {"field":"study.study_id", "value":"PRJNA300878"}}}' https://vdjserver.org/airr/v1/repertoire > vdjserver-PRJNA300878-ADC.json

ADC and converted AKC JSON can be found here:

https://github.com/airr-knowledge/ak-schema/tree/airr-export/examples/adc

Converted JSON was generated with the following:

python dataloader.py -v --repertoire -f vdjserver-PRJNA300878-ADC.json -o vdjserver-PRJNA300878-AKC.json --mapfile AIRR-iReceptorMapping-v1.4-2024-07-30.txt
bcorrie commented 4 weeks ago

@schristley I have been updating the consolidated mapping sheet all the way down to Assay to ensure that everything in the ak_schema in ak_specimens.yaml and ak_study_design.yaml are captured. I added a column to track where each slot is defined.

It would be great if we could remind people that if they update the schema they should update the consolidated sheet as well. This is the only mechanism that we have (at least currently) that makes it possible to maintain mappings across the platforms. I haven't checked the other files.

bcorrie commented 4 weeks ago

This means that anything that isn't in the consolidated sheet marked in blue, it isn't in the AKC schema.

bcorrie commented 4 weeks ago

In the consolidated spreadsheet I have started to use the protocol of using Class in the datatype column to indicate that the slot should contain an ID to an external class object. If the multivalued column is True then the slot holds an array of IDs that point to external objects. For example in the AKC schema we have:

  participants:
    description: The participants involved with the investigation
    slot_uri: RO:0000057 # has participant
    range: Participant
    multivalued: true

And in the conversion of an AIRR study, we get something like this currently:

{
        "akc_id": "ddcc08af-e82f-49ac-b9bc-854c38c7cdd0",
        "name": "Yost et al., Clonal replacement of tumor-specific T cells following PD-1 blockade",
        "description": "Paired single-cell RNA and T cell receptor sequencing on 79,046 cells from site-matched tumors from patients with basal or squamous cell carcinoma before and after anti-PD-1 therapy.",
        "study_type": {
          "label": "Study",
          "id": "NCIT:C63536"
        },
        "archival_id": "PRJNA509910",
        "inclusion_criteria": "include: advanced or metastatic BCC or SCC not suitable for surgical resection; exclude: previous exposure to checkpoint blockade agents and systemic immunosuppressant use, treatment with radiotherapy or other anti-cancer agents within 4 weeks of first biopsy",
        "release_date": "2022-10-28T18:45:03.627212+00:00",
        "update_date": "2022-10-28T18:45:03.627212+00:00",
        "participants": [
          "3879379c-7af5-4c55-8fb6-9bd9719641ad",
          "74043684-beba-44f7-99c8-08f44fec8312",
          "2fd29ddf-1d5a-4283-9ac4-6d552442caa5",
          "b12c815c-928f-4a77-9b70-3794ea12dfb3"
        ],
        "documents": [
          "f6627e49-02a2-42f4-8c8b-ca2c010b5a78"
        ]
}

Where both participants and documents are pointing to akc_id values for the appropriate, converted AKC Particpant and Reference objects.

@schristley let me know if you have any thoughts/concerns. That was the easy part, I am not trying to unravel the mess that is the conversion of the AIRR Sample and Diagnosis models to the AKC LifeEvent, Specimen model... 8-)

bcorrie commented 4 weeks ago

Updated the JSON document in the PRJNA509910 analysis directory on the google drive to be up to date with current code base.

Linking is working for the following:

This essentially means that the ADC Study -> Subject -> Sample linking is done, with the Sample linking on the AKC side being done via the LifeEvent object rather than directly between Particpant->Specimen.

I am pretty sure the linking will work for any other relationships where there is a 1-n relationship without extra linking objects, we just don't have any data in the ADC for things like Assays, Conclusions, Simulations at the moment.

bcorrie commented 4 weeks ago

The next challenge is to figure out how to handle ADC Diagnosis and ADC Age, both of which in the AKC model are associated with LifeEvents.

bcorrie commented 3 weeks ago

You should be able to apply this to any study in the ADC in principle, but I will need to test more. You can extract the repertoire metadata for a study with the following curl command:

curl -H 'content-type: application/json' -d '{"filters":{"op":"=", "content": {"field":"study.study_id", "value":"PRJNA300878"}}}' https://vdjserver.org/airr/v1/repertoire > vdjserver-PRJNA300878-ADC.json

Replace the study ID and the repository to get your favorite study.

bcorrie commented 3 weeks ago

Added a shell script to simplify processing a study:

bash akc-convert-study.sh ipa6.ireceptor.org PRJNA509910 test

Converts study with ADC study_id == PRJNA509910 (in repository ipa6.ireceptor.org) into AKC LinkML JSON in directory test

This needs to be run in the Docker container built from the Dockerfile on the airr-export branch, as it requires jq to run, which isn't in the standard Docker container.

schristley commented 3 weeks ago

@schristley let me know if you have any thoughts/concerns. That was the easy part, I am not trying to unravel the mess that is the conversion of the AIRR Sample and Diagnosis models to the AKC LifeEvent, Specimen model... 8-)

@bcorrie We aren't properly using AKC StudyArm, and from looking at the spreadsheet and schema, they also look incomplete. The AKC StudyArm should define groups of participants in a study, for example case/control would have a "case" StudyArm and a "control" StudyArm, with each StudyArm linking to its participants. If you search in the consolidated spreadsheet for "arm_participant", you will find a couple of fields which define the linkage, and they have not been properly incorporated into the schema yet. Also search for "arm_study_event" which has some linkage between StudyArm and StudyEvent. There was some discussion about do we need both StudyEvent and LifeEvent, but I don't exactly remember the conclusion (I want to say we settled on just LifeEvent for both). Anyways, give StudyArm a thought. Can we create the StudyArms from AIRR's study_group_description? and/or maybe with disease_state_sample?

bcorrie commented 3 weeks ago

Can we create the StudyArms from AIRR's study_group_description? and/or maybe with disease_state_sample?

I don't think so - at least not in general... The AIRR standard isn't rigorous enough for this. We could potentially do this in an "ad-hoc" way in the sense that for the data that we have curated we have tried to use a controlled vocabulary to describe this. This has long been an issue for the AIRR Standard (https://github.com/airr-community/airr-standards/issues/516#issuecomment-810310989) that we have "solved" internally - but this won't work for any data in VDJServer, Scireptor, or Meunster (or work in general).

For us study_group_description will usually contain StudyArm like descriptions, with either "Case" or "Control" with the main keyword and more detail for each study arm in parentheses, for example "Case (T1D)" and "Control (no T1D)". So we could parse this (for the studies that have this) and generate study arms based on how this field is parsed.

Do you have a set of fields in VDJServer that could be used for this. As per usual, if we can figure out how to manage this for the iReceptor and VDJServer repositories, we have things 95% complete...

bcorrie commented 3 weeks ago

We aren't properly using AKC StudyArm, and from looking at the spreadsheet and schema, they also look incomplete.

Yes, I am in no way claiming that any of these are complete or correct. 8-)

The only thing that I can claim is that:

At some level, that was why I color coded things in the spreadsheet, as it seems to me there are some pretty big gaps in what is in the spreadsheet, what is in the schema, and what we really need/want. Bottom line is there isn't very much blue yet. The Study Arm fields are definitely lacking in the actual schema as far as I can tell, and as you suggest, I don't think they are modeling what they should be at the moment...

bcorrie commented 3 weeks ago

Also search for "arm_study_event" which has some linkage between StudyArm and StudyEvent

The AIRR standard has no such concepts as these, so there will be no content generated for these classes. We might be able to classify Particpants into StudyArms (by parsing a free text field or two as described above) but that will be the most that we can do.

I am curious to see if IEDB actually uses this (at least to any great degree). How many IEDB studies have a level of detail that contains StudyEvents (it is unclear to me what these are supposed to be actually) or have complex StudyArms (more than a couple)? That seems needlessly complex to me. Certainly this would unlikely to be populated from the studies in the ADC.

schristley commented 3 weeks ago

Can we create the StudyArms from AIRR's study_group_description? and/or maybe with disease_state_sample?

Do you have a set of fields in VDJServer that could be used for this. As per usual, if we can figure out how to manage this for the iReceptor and VDJServer repositories, we have things 95% complete...

Yes, I've been using study_group_description to indicate these groups. Many studies are observational and only have one group, but some have multiple. Try a facets to see the values.

curl -H 'content-type:application/json' --data '{"facets":"subject.diagnosis.study_group_description"}' curl https://vdjserver.org/airr/v1/repertoire | jq

Your comment regarding "healthy" subjects is still valid though. The VDJServer studies have some variability in how that is indicated, but that can be fixed.

schristley commented 3 weeks ago

Also search for "arm_study_event" which has some linkage between StudyArm and StudyEvent

I am curious to see if IEDB actually uses this (at least to any great degree). How many IEDB studies have a level of detail that contains StudyEvents (it is unclear to me what these are supposed to be actually) or have complex StudyArms (more than a couple)? That seems needlessly complex to me. Certainly this would unlikely to be populated from the studies in the ADC.

James' code is creating a StudyArm, but not using it for much. Likewise for StudyEvent. My understanding is they posed it this way as it is similar to the data model being used by immunespace and HIPC. The AIRR-seq data from HIPC is supposed to be put in the ADC.

jamesaoverton commented 3 weeks ago

The IEDB schema and curation model were developed 20 years ago, and we've been improving upon them ever since as we take on new projects. The i-AKC schema builds on current work with ImmPort, HIPC, and ImmuneSpace, as @schristley said. IEDB doesn't track individual participants or their study arms, but we can fit IEDB data into the newer, more detailed schema by making general statements about it, like so:

https://github.com/airr-knowledge/ak-schema/blob/epitope-1/examples/iedb/convert.py#L142

Different projects capture some parts of a study in more detail and some parts in less detail, depending on their needs. In order to integrate data across all those projects, we need to support detailed data across the whole study. If a particular project, like IEDB, doesn't have a lot of detail about study arms (for example), we can still fit those general statements into the schema and the ontology.

bcorrie commented 3 weeks ago

The ADC conversion is currently creating a separate StudyArm for each distinct value in the ADC study_group_description field from the ADC Repertoires from and ADC Study. So if you have CASE (T1D) and Control (no T1D) assigned to the appropriate subjects then you will get two StudyArms with the correct Participants assigned to those StudyArms and the StudyArms will be correctly associated with the Investigation.

That is about the best we can do at the moment I think.

bcorrie commented 1 week ago

Linking of all generated types is I believe working. Example outputs of converted studies can be found in the Immunology KG's google drive.

Things left to do: