Open allisonheath opened 3 years ago
The KF FHIR team has curated modeling discussions on KFDRC FHIR Model Mappings.
Through a series of extensive discussions, we decided to model the above entities and properties into two components, kfdrc-genomic-file
using DocumentReference
and kfdrc-sequencing-experiment
using Task
. sequencing_center
can directly be mapped to Organization
without needing the creation of a new profile, so we didn't include it in modeling.
kfdrc-genomic-file
We decided to use DocumentReference
as a base profile based on other initiatives' effort in this area:
HtsFile
: https://aehrc.github.io/fhir-phenopackets-ig/StructureDefinition-HtsFile.htmlanvil-document-reference
: http://anvil-fhir.s3-website-us-west-2.amazonaws.com/StructureDefinition-anvil-document-reference.htmlThen, from the above properties regarding genomic_file
, we excluded is_harmonized
, paired_end
, and reference_genome
because these properties are conceptually not file metadata, but output dimensions of genomic sequencing.
On top of this, we decided to add an extension called accession-identifier
because, in KFDRC, we control file accession based on various levels of user authorization.
Our modeling effort as part of the software development cycle has been curated here:
While working on modeling kfdrc-genomic-file
, we found the following issues (described in the issue in detail):
DocumentReference.content.attachment.size
is unsignedInt
which ranges between 0 and 2,147,483,647. KF genomic_file
s usually overflow this range limit.large-size
where the type of data is decimal
which doesn't have a range limit. This extension will be bound to Attachment
.CodeSystem-data-type
and CodeSystem-file-format
and bind these to ValueSet-data-type
and ValueSet-file-format
respectively. Finally, bind these ValueSets to DocumentReference.type
and DocumentReference.content.format
respectively.The following is an example resource:
{
"resourceType": "DocumentReference",
"id": "gf-001",
"meta": {
"profile": [
"http://fhir.kids-first.io/StructureDefinition/kfdrc-genomic-file"
],
"versionId": "0.1.0"
},
"identifier": [
{
"system": "https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?study_id=SD_PREASA7S",
"value": "kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam"
}
],
"extension": [
{
"extension": [
{
"url": "accession",
"valueIdentifier": {
"value": "phs001138.c1"
}
},
{
"url": "accession",
"valueIdentifier": {
"value": "SD_PREASA7S"
}
}
],
"url": "http://fhir.kids-first.io/StructureDefinition/accession-identifier"
}
],
"status": "current",
"type": {
"coding": [
{
"system": "http://fhir.kids-first.io/CodeSystem/data-type",
"code": "C164052",
"display": "Aligned Sequence Read"
}
],
"text": "Aligned Reads"
},
"subject": {
"reference": "Patient/pt-001"
},
"content": [
{
"attachment": {
"extension": [
{
"url": "http://fhir.kids-first.io/StructureDefinition/large-size",
"valueDecimal": 72605537636
}
],
"url": "s3://kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam",
"title": "HMNVCCCXX-7.hgv.bam"
},
"format": {
"display": "bam"
}
}
]
}
kfdrc-sequencing-experiment
HL7 has made a concerted effort to bring in genoimcs largely using DiagnosticReport
, MolecularSequence
, and Observation
. The main differences between HL7's genomics implementation and KF's sequencing_experiment
include:
sequencing_experiment
is the processes of 1) sequencing specimens to yield "source" genomic_file
s (by sequencing_center
s) and 2) aligning these source genomic_file
s against specific reference_genome
s to yield "harmonized" genomic_file
s (by BIXU).Against this backdrop, our initial effort in modeling kfdrc-sequencing-experiment
focuses on the above-explained "processes."
Given the above properties of sequencing_experiment
, we characterized them into three dimensions:
"Information about a sequencing event" is a set of metadata such as experiment date (Task.authoredOn
) and performer (Task.owner
) and we discuss sequencing inputs and outputs per process (i.e. source / harmonized) in detail below.
Our modeling effort as part of the software development cycle has been curated here:
Task
as partOf
Task
Currently, the KF DRC briefly undergoes the following process:
biospecimen
s;sequencing_experiment
s given sequencing manifests from sequencing_center
s;genomic_file
s uploaded to our S3 by sequencing_center
s;biospecimen
s and source genomic_file
s; andsequencing_experiment
s and source genomic_file
sOnce BIXU's delivered harmonized genomic_file
s:
genomic_file
s uploaded to our S3 by BIXU;biospecimen
s and harmonized genomic_file
s; andsequencing_experiment
s and harmonized genomic_file
sTechnically, the harmonized genomic_file
s are yielded via different sequencing_experiment
s. Why we have done as illustrated above is that the current KF model doesn't have a means to bundle source and harmonized sequencing_experiment
s (if we've created separate harmonized sequencing_experiment
s).
Using FHIR's Task
well addresses the above issue because a Task
can be part of another Task
(Task.partOf
). Therefore, we imagine having three Task
s, one parent Task
and the other children (one for source sequencing_experiment
and the other for harmonized sequencing_experiment
). Please see the "Genomics (workflow)" tab of KFDRC FHIR ERD for it graphically renders the proposed concept.
kfdrc-sequencing-experiment
Below shows our KF entities / properties
>> FHIR attributes
mappings:
Task.input
)biospecimen
>> valueReference
experiment_strategy
>> valueCodeableConcept
instrument_model
>> valueCodeableConcept
is_paried_end
>> valueBoolean
library_name
>> valueString
library_prep
>> valueCodeableConcept
library_selection
>> valueCodeableConcept
library_strand
>> valueCodeableConcept
platform
>> valueCodeableConcept
Task.output
)genomic_file
>> valueReference
is_harmonized
>> valueBoolean
paired_end
>> valueInteger
reference_genome
>> valueCodeableConcept
max_insert_size
>> valueQuantity
mean_depth
>> valueQuantity
mean_insert_size
>> valueQuantity
mean_read_length
>> valueQuantity
total_reads
>> valueQuantity
kfdrc-sequencing-experiment
Below shows our KF entities / properties
>> FHIR attributes
mappings:
Task.input
)genomic_file
>> valueReference
Task.output
)genomic_file
>> valueReference
reference_genome
>> valueCodeableConcept
Re Data type / file format for kfdrc-genomic-file
, during the standup on 09-16-2020, we temporarily decided:
DocumentReference.content.format.display
.DRS is the GA4GH preferred mechanism to represent file objects: The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID. -- data-repository-service-schemas
FHIR representations of files associated with Study, Subject, Specimen will be more useful to downstream use cases if they contained DRS Attributes.
A FSH extension of the original openapi definition:
Profile: DRSAttachment
Parent: Attachment
Id: drs-attachment
Title: "DRS Attachment"
Description: "A FHIR Attachment extended with DRS Object attributes."
// https://github.com/ga4gh/data-repository-service-schemas/blob/master/openapi/data_repository_service.swagger.yaml#L190-L304
// adds DRSObject to Attachment
* extension contains DRSObject named drs 0..1
// inline definition of sub-extensions
Extension: DRSObject
Id: drs-object
Title: "DRS Object"
Description: "The drs object"
* extension contains
id 1..1 MS and
name 0..1 and
self_uri 1..1 MS and
size 1..1 MS and
created_time 1..1 MS and
updated_time 0..1 and
version 0..1 and
mime_type 0..1
// and DRSChecksum named checksums 1..* MS
// and DRSAccessMethod named access_methods 1..* MS
* extension[id] ^short = "An identifier unique to this `DrsObject`."
* extension[id].value[x] only string
* extension[name] ^short = "A string that can be used to name a `DrsObject`."
* extension[name].value[x] only string
* extension[self_uri] ^short = "A drs:// URI, as defined in the DRS documentation, that tells clients how to access this object."
* extension[self_uri].value[x] only string
* extension[size] ^short = "For blobs, the blob size in bytes. For bundles, the cumulative size, in bytes, of items in the `contents` field."
* extension[size].value[x] only integer
* extension[created_time] ^short = "Timestamp of content creation in RFC3339."
* extension[created_time].value[x] only dateTime
* extension[updated_time] ^short = "Timestamp of content update in RFC3339, identical to `created_time` in systems that do not support updates."
* extension[updated_time].value[x] only dateTime
* extension[version] ^short = "A string representing a version. (Some systems may use checksum, a RFC3339 timestamp, or an incrementing version number.)"
* extension[version].value[x] only string
* extension[mime_type] ^short = "A string providing the mime-type of the `DrsObject`."
* extension[mime_type].value[x] only string
Extension: DRSChecksum
Id: drs-checksum
Title: "DRS Checksum"
Description: "The checksum of the `DrsObject`. At least one checksum must be provided."
* extension contains
checksum 1..1 MS and
type 1..1 MS
* extension[checksum] ^short = "The hex-string encoded checksum for the data."
* extension[checksum].value[x] only string
* extension[type] ^short = "The digest method used to create the checksum."
* extension[type].value[x] only string
Extension: DRSAccessMethod
Id: drs-access-method
Title: "DRS AccessMethod"
Description: "The list of access methods that can be used to fetch the `DrsObject`."
* extension contains
type 1..1 MS and
access_url 0..1 and
access_id 0..1 and
region 0..1
* extension[type] ^short = "Type of the access method."
* extension[type].value[x] only string
* extension[access_url] ^short = "An `AccessURL` that can be used to fetch the actual object bytes."
* extension[access_url].value[x] only string
* extension[access_id] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[access_id].value[x] only string
* extension[region] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[region].value[x] only string
Instance: DRSAttachmentExample
InstanceOf: DRSAttachment
Description: "An example representation of a DRSAttachment"
Usage: #inline
* id = "any-attachment-id"
* contentType = #application/json
* extension[drs].extension[id].valueString = "any-id"
* extension[drs].extension[name].valueString = "any-file-name"
* extension[drs].extension[self_uri].valueString = "drs://url-here"
* extension[drs].extension[created_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[updated_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[size].valueInteger = 12345
* extension[drs].extension[version].valueString = "0.0.0"
* extension[drs].extension[mime_type].valueString = "application/json"
* extension[drs].extension[checksums].extension[checksum].valueString = "abcdef0123456789"
* extension[drs].extension[checksums].extension[type].valueString = "etag"
* extension[drs].extension[access_methods].extension[type].valueString = "s3"
* extension[drs].extension[access_methods].extension[access_url].valueString = "s3://some-url-here"
* extension[drs].extension[access_methods].extension[region].valueString = "us-west"
Requester information
Please provide the following information:
Request Details
Please provide the following information about what you wanting to accomplish with your model change request:
We have done a few iterations on this @liberaliscomputing could you provide a bit more details on our last version we're going to try for this?
cc @nicholasvk @youngnm @baileyckelly