airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

Assessment of validation processes and needs (1.4A) #8

Closed schristley closed 3 months ago

schristley commented 8 months ago

Aim 1.4 Critical metric: the number of repositories that implement the validation processes. The proportion of DEs covered by the validation processes. The proportion of data within each repository that is covered by the validation processes. The number of processes that have been automated.

We will develop processes to ensure the contents of each repository are complete, accurate, and compliant with the standards, ontologies, and CDM from Aims 1.1-1.2. Each repository has existing validation processes that will be enhanced and further automated to incorporate the standards. These will be applied during and after curation. Any data/metadata that fails validation will be flagged for human review and correction which will inform refinement of the curation procedures.

williamdlees commented 8 months ago

I have added a document for VDJbase to the AIRR Knowledge folder covering 1.3A/1.4A, @schristley please take a look and let me know whether it's what you were expecting

bcorrie commented 4 months ago

My curation doc also contains validation processes.

bcorrie commented 3 months ago

@schristley validation need we talked about at our last meeting - checking ADC for consistency of objects for potential "normalization/conversion" to the AKC CDEs. A proof of concept.

For example, all Subject fields should be the same if the study_id and subject_id are the same. This is pretty easy to check...

For a subject from a specific study use the following ADC query:

{
  "filters": {
    "op":"and",
    "content": [
      { "op":"=", "content": {"field":"subject.subject_id", "value":"nPOD6342" } },
      { "op":"=", "content": { "field":"study.study_id", "value":"DOI:10.1073/pnas.2107208118" }}
    ]
  }
}

We get the repertoire_id:

curl -s -d @repertoire-project-subject.json https://t1d-1.ireceptor.org/airr/v1/repertoire | jq '.Repertoire[].repertoire_id'
"656b934f190f680f22dc2120"
"656b934f190f680f22dc2121"
"656b934f190f680f22dc2122"
"656b9350190f680f22dc2123"

For each repertoire:

curl -s -d '{"filters": {"op":"=","content":{"field":"repertoire_id", "value":"656b9350190f680f22dc2123"}}}' https://t1d-1.ireceptor.org/airr/v1/repertoire | jq '.Repertoire[0].subject' > 656b9350190f680f22dc2123.out
curl -s -d '{"filters": {"op":"=","content":{"field":"repertoire_id", "value":"656b934f190f680f22dc2122"}}}' https://t1d-1.ireceptor.org/airr/v1/repertoire | jq '.Repertoire[0].subject' > 656b934f190f680f22dc2122.out
curl -s -d '{"filters": {"op":"=","content":{"field":"repertoire_id", "value":"656b934f190f680f22dc2121"}}}' https://t1d-1.ireceptor.org/airr/v1/repertoire | jq '.Repertoire[0].subject' > 656b934f190f680f22dc2121.out
curl -s -d '{"filters": {"op":"=","content":{"field":"repertoire_id", "value":"656b934f190f680f22dc2120"}}}' https://t1d-1.ireceptor.org/airr/v1/repertoire | jq '.Repertoire[0].subject' > 656b934f190f680f22dc2120.out

This extracts the subject JSON only. We can then do a diff:

diff 656b9350190f680f22dc2123.out 656b934f190f680f22dc2120.out
diff 656b9350190f680f22dc2123.out 656b934f190f680f22dc2121.out
diff 656b9350190f680f22dc2123.out 656b934f190f680f22dc2122.out

Voila, subject fields for the four repertoires from this subject are identical!

I suspect writing a python tool to do this for a given study would be pretty straightforward. We would need to decide the criteria that we would expect to be the same (it might make sense in some cases for their to be differences???).

schristley commented 3 months ago

Documents for all of the repositories have been created in google drive except for IRAD which is still in development.