anvilproject / cmg-data-ingest

Ingest scripts for the CMG dataset to FHIR
0 stars 0 forks source link

Add support for CMG Sequencing table data #1

Open torstees opened 3 years ago

torstees commented 3 years ago

For first round of integration testing with Kids First, we need the priority 1 fields from sequencing (and maybe 2). These fields include:

Priority 1

seq_filename analyte_type sequencing_assay library_prep_kit_method reference_genome_build alignment_method data_processing_pipeline functional_equivalence_standard date_data_generation

Priority 2

exome_capture_platform capture_region_bed_file

Our solution will likely be mostly borrowed from the discussion of kidsfirst-sequence-experiment by the Kids First team

katiebanaz commented 3 years ago

Test comment

torstees commented 3 years ago

In order to establish uniqueness, based on a quick look at the data, it seems safe to base a sequencing object on the filename.

torstees commented 3 years ago

First pass is complete with a handful of general assignments, primarily borrowed from the KF docs of similar data.

resourceType: Task owner => Sequencing Center (Organization) authoredOn => date_data_generation

The majority of the details can be found in either of the input or output arrays. Currently, most of these are simple strings, but they can be switched to codes once we have a clear terminology to use. Which vars go into input vs out is largely arbitrary, but I believe the KF team were thinking along the lines of what goes into the actual genotyping device rather than considering the concept of a black box in which samples go in and final products come out. So, input to the actual pipelines are currently sitting in the output array.

Inputs:

Outputs:

`  {
    "host": "http://localhost:8000",
    "type": "sequencing_data",
    "body": {
      "resourceType": "Task",
      "id": "38924.merged.matefixed.sorted.markeddups.recal.bam",
      "status": "completed",
      "description": "Generate sequence data for use by researchers",
      "owner": {
        "reference": "Organization/FD",
        "display": "FD"
      },
      "meta": {
        "profile": [
          "http://hl7.org/fhir/StructureDefinition/Task"
        ]
      },
      "identifier": [
        {
          "system": "urn:ncpi:unique-string",
          "value": "Task|38924.merged.matefixed.sorted.markeddups.recal.bam"
        }
      ],
      "output": [
        {
          "type": {
            "text": "Reference Genome Build"
          },
          "valueString": "GRCh38DH"
        },
        {
          "type": {
            "text": "Alignment Method"
          },
          "valueString": "bwa-0.7.15"
        },
        {
          "type": {
            "text": "Data Processing Pipeline"
          },
          "valueString": "3.0_DNA_Pipeline"
        },
        {
          "type": {
            "text": "Functional Equivalence Standard"
          },
          "valueBoolean": "false"
        }
      ],
      "input": [
        {
          "type": {
            "text": "Sample"
          },
          "valueReference": {
            "reference": "Specimen/4774"
          }
        },
        {
          "type": {
            "text": "Analyte Type"
          },
          "valueString": "DNA"
        },
        {
          "type": {
            "text": "Library Prep Kit"
          },
          "valueString": "DNA_3.0_library_prep"
        },
        {
          "type": {
            "text": "Exome Capture Platform"
          },
          "valueString": "nimblegen_solution_bigexome_2011"
        },
        {
          "type": {
            "text": "Capture Region Bed File"
          },
          "valueString": "nimblegen_solution_bigexome_2011.hg19.list.bed"
        }
      ],
      "authoredOn": "2016-05-26"
    }
  },`
torstees commented 3 years ago

Closed by accident

torstees commented 3 years ago

Things have changed since this was originally described, largely as a result of further discussions with the folks from the KF team. For our current use, due to the small number of attributes, all of the input still reasonably apply to the Sequencing Task itself, however, the output has been stripped except for the actual Document Reference, which represents the actual biproduct of the sequencing process. We then attach an Observation onto that Doc Ref which contains various components describing the contents of that document, such as the Reference Sequence, Alignment Method, etc.

seq-data-graphic