NIH-NCPI / ncpi-model-forge

🔥 The Project Forge FHIR model
Apache License 2.0
4 stars 5 forks source link

Representing Sequencing Information and Genomic Data Files #23

Open allisonheath opened 3 years ago

allisonheath commented 3 years ago

Requester information

Please provide the following information:

entity property
genomic_file acl
genomic_file availability
genomic_file controlled_access
genomic_file data_type
genomic_file external_id
genomic_file file_format
genomic_file file_name
genomic_file hashes
genomic_file is_harmonized
genomic_file kf_id
genomic_file paired_end
genomic_file reference_genome
genomic_file size
genomic_file urls
genomic_file visible
sequencing_center external_id
sequencing_center name
sequencing_center kf_id
sequencing_center visible
sequencing_experiment experiment_date
sequencing_experiment experiment_strategy
sequencing_experiment external_id
sequencing_experiment instrument_model
sequencing_experiment is_paired_end
sequencing_experiment kf_id
sequencing_experiment library_name
sequencing_experiment library_prep
sequencing_experiment library_selection
sequencing_experiment library_strand
sequencing_experiment max_insert_size
sequencing_experiment mean_depth
sequencing_experiment mean_insert_size
sequencing_experiment mean_read_length
sequencing_experiment platform
sequencing_experiment sequencing_center_id
sequencing_experiment total_reads
sequencing_experiment visible

We have done a few iterations on this @liberaliscomputing could you provide a bit more details on our last version we're going to try for this?

cc @nicholasvk @youngnm @baileyckelly

liberaliscomputing commented 3 years ago

The KF FHIR team has curated modeling discussions on KFDRC FHIR Model Mappings.

Through a series of extensive discussions, we decided to model the above entities and properties into two components, kfdrc-genomic-file using DocumentReference and kfdrc-sequencing-experiment using Task. sequencing_center can directly be mapped to Organization without needing the creation of a new profile, so we didn't include it in modeling.

1. kfdrc-genomic-file

We decided to use DocumentReference as a base profile based on other initiatives' effort in this area:

Then, from the above properties regarding genomic_file, we excluded is_harmonized, paired_end, and reference_genome because these properties are conceptually not file metadata, but output dimensions of genomic sequencing.

On top of this, we decided to add an extension called accession-identifier because, in KFDRC, we control file accession based on various levels of user authorization.

Our modeling effort as part of the software development cycle has been curated here:

While working on modeling kfdrc-genomic-file, we found the following issues (described in the issue in detail):

The following is an example resource:

{
  "resourceType": "DocumentReference",
  "id": "gf-001",
  "meta": {
    "profile": [
      "http://fhir.kids-first.io/StructureDefinition/kfdrc-genomic-file"
    ],
    "versionId": "0.1.0"
  },
  "identifier": [
    {
      "system": "https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?study_id=SD_PREASA7S",
      "value": "kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam"
    }
  ],
  "extension": [
    {
      "extension": [
        {
          "url": "accession",
          "valueIdentifier": {
            "value": "phs001138.c1"
          }
        },
        {
          "url": "accession",
          "valueIdentifier": {
            "value": "SD_PREASA7S"
          }
        }
      ],
      "url": "http://fhir.kids-first.io/StructureDefinition/accession-identifier"
    }
  ],
  "status": "current",
  "type": {
    "coding": [
      {
        "system": "http://fhir.kids-first.io/CodeSystem/data-type",
        "code": "C164052",
        "display": "Aligned Sequence Read"
      }
    ],
    "text": "Aligned Reads"
  },
  "subject": {
    "reference": "Patient/pt-001"
  },
  "content": [
    {
      "attachment": {
        "extension": [
          {
            "url": "http://fhir.kids-first.io/StructureDefinition/large-size",
            "valueDecimal": 72605537636
          }
        ],
        "url": "s3://kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam",
        "title": "HMNVCCCXX-7.hgv.bam"
      },
      "format": {
        "display": "bam"
      }
    }
  ]
}
liberaliscomputing commented 3 years ago

2. kfdrc-sequencing-experiment

HL7 has made a concerted effort to bring in genoimcs largely using DiagnosticReport, MolecularSequence, and Observation. The main differences between HL7's genomics implementation and KF's sequencing_experiment include:

Against this backdrop, our initial effort in modeling kfdrc-sequencing-experiment focuses on the above-explained "processes."

Given the above properties of sequencing_experiment, we characterized them into three dimensions:

"Information about a sequencing event" is a set of metadata such as experiment date (Task.authoredOn) and performer (Task.owner) and we discuss sequencing inputs and outputs per process (i.e. source / harmonized) in detail below.

Our modeling effort as part of the software development cycle has been curated here:

2.1 Task as partOf Task

Currently, the KF DRC briefly undergoes the following process:

  1. Register biospecimens;
  2. Register source sequencing_experiments given sequencing manifests from sequencing_centers;
  3. Register source genomic_files uploaded to our S3 by sequencing_centers;
  4. Link biospecimens and source genomic_files; and
  5. Link source sequencing_experiments and source genomic_files

Once BIXU's delivered harmonized genomic_files:

  1. Register harmonized genomic_files uploaded to our S3 by BIXU;
  2. Link biospecimens and harmonized genomic_files; and
  3. Link source sequencing_experiments and harmonized genomic_files

Technically, the harmonized genomic_files are yielded via different sequencing_experiments. Why we have done as illustrated above is that the current KF model doesn't have a means to bundle source and harmonized sequencing_experiments (if we've created separate harmonized sequencing_experiments).

Using FHIR's Task well addresses the above issue because a Task can be part of another Task (Task.partOf). Therefore, we imagine having three Tasks, one parent Task and the other children (one for source sequencing_experiment and the other for harmonized sequencing_experiment). Please see the "Genomics (workflow)" tab of KFDRC FHIR ERD for it graphically renders the proposed concept.

2.2 Source kfdrc-sequencing-experiment

Below shows our KF entities / properties >> FHIR attributes mappings:

  1. Inputs (Task.input)
  1. Outputs (Task.output)

2.3 Harmonized kfdrc-sequencing-experiment

Below shows our KF entities / properties >> FHIR attributes mappings:

  1. Inputs (Task.input)
  1. Outputs (Task.output)
liberaliscomputing commented 3 years ago

Re Data type / file format for kfdrc-genomic-file, during the standup on 09-16-2020, we temporarily decided:

bwalsh commented 3 years ago

DRS is the GA4GH preferred mechanism to represent file objects: The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID. -- data-repository-service-schemas

FHIR representations of files associated with Study, Subject, Specimen will be more useful to downstream use cases if they contained DRS Attributes.

A FSH extension of the original openapi definition:

Profile:        DRSAttachment
Parent:         Attachment
Id:             drs-attachment
Title:          "DRS Attachment"
Description:    "A FHIR Attachment extended with DRS Object attributes."
// https://github.com/ga4gh/data-repository-service-schemas/blob/master/openapi/data_repository_service.swagger.yaml#L190-L304

// adds DRSObject to Attachment
* extension contains DRSObject named drs 0..1

// inline definition of sub-extensions
Extension:  DRSObject
Id: drs-object
Title: "DRS Object"
Description: "The drs object"
* extension contains
    id 1..1 MS and
    name 0..1 and
    self_uri 1..1 MS and
    size 1..1 MS and
    created_time 1..1 MS and
    updated_time 0..1 and
    version 0..1 and
    mime_type 0..1 
    // and DRSChecksum named checksums 1..* MS
    // and DRSAccessMethod named access_methods 1..* MS

* extension[id] ^short = "An identifier unique to this `DrsObject`."
* extension[id].value[x] only string
* extension[name] ^short = "A string that can be used to name a `DrsObject`."
* extension[name].value[x] only string
* extension[self_uri] ^short = "A drs:// URI, as defined in the DRS documentation, that tells clients how to access this object."
* extension[self_uri].value[x] only string
* extension[size] ^short = "For blobs, the blob size in bytes.  For bundles, the cumulative size, in bytes, of items in the `contents` field."
* extension[size].value[x] only integer
* extension[created_time] ^short = "Timestamp of content creation in RFC3339."
* extension[created_time].value[x] only dateTime
* extension[updated_time] ^short = "Timestamp of content update in RFC3339, identical to `created_time` in systems that do not support updates."
* extension[updated_time].value[x] only dateTime
* extension[version] ^short = "A string representing a version. (Some systems may use checksum, a RFC3339 timestamp, or an incrementing version number.)"
* extension[version].value[x] only string
* extension[mime_type] ^short = "A string providing the mime-type of the `DrsObject`."
* extension[mime_type].value[x] only string

Extension:  DRSChecksum
Id: drs-checksum
Title: "DRS Checksum"
Description: "The checksum of the `DrsObject`. At least one checksum must be provided."    
* extension contains
    checksum 1..1 MS and
    type 1..1 MS
* extension[checksum] ^short = "The hex-string encoded checksum for the data."
* extension[checksum].value[x] only string
* extension[type] ^short = "The digest method used to create the checksum."
* extension[type].value[x] only string

Extension:  DRSAccessMethod
Id: drs-access-method
Title: "DRS AccessMethod"
Description: "The list of access methods that can be used to fetch the `DrsObject`."    
* extension contains
    type 1..1 MS and
    access_url 0..1 and
    access_id 0..1 and
    region 0..1
* extension[type] ^short = "Type of the access method."
* extension[type].value[x] only string
* extension[access_url] ^short = "An `AccessURL` that can be used to fetch the actual object bytes."
* extension[access_url].value[x] only string
* extension[access_id] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[access_id].value[x] only string
* extension[region] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[region].value[x] only string

Instance: DRSAttachmentExample
InstanceOf: DRSAttachment
Description: "An example representation of a DRSAttachment"
Usage: #inline
* id = "any-attachment-id"
* contentType = #application/json
* extension[drs].extension[id].valueString = "any-id"
* extension[drs].extension[name].valueString = "any-file-name"
* extension[drs].extension[self_uri].valueString = "drs://url-here"
* extension[drs].extension[created_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[updated_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[size].valueInteger = 12345
* extension[drs].extension[version].valueString = "0.0.0"
* extension[drs].extension[mime_type].valueString = "application/json"
* extension[drs].extension[checksums].extension[checksum].valueString = "abcdef0123456789"
* extension[drs].extension[checksums].extension[type].valueString = "etag"
* extension[drs].extension[access_methods].extension[type].valueString = "s3"
* extension[drs].extension[access_methods].extension[access_url].valueString = "s3://some-url-here"
* extension[drs].extension[access_methods].extension[region].valueString = "us-west"