IHEC / ihec-ecosystems

This repo is for code and documentation associated with the ihec-ecosystems working group
Apache License 2.0
5 stars 6 forks source link

Unified submission #50

Open dbujold opened 5 years ago

dbujold commented 5 years ago

Currently, the data hub validator relies on an exact match between an EpiRR record experiment name and IHEC Data Hub name. This is problematic because 1- An EpiRR record can have more than one experiment of the same type 2- The name used to describre the experiment can be different between both sources 3- Experiment Type property will soon be an optional property, that can be replaced by an ontology URI.

dzerbino commented 4 years ago

Result of Banff discussion: this issue would be resolved by unifying submissions into EpiRR and the IHEC Portal. Renaming issues.

sitag commented 4 years ago

To document: the proposal is to cross reference all repeated metadata fields in datahub schema from the epirr registry.

dzerbino commented 4 years ago

Current pipeline: 201710_IHEC_Ecosystem_MK

Spec for EpiRR JSON: https://github.com/Ensembl/EpiRR/blob/master/README.md Spec for Portal JSON: https://github.com/IHEC/ihec-ecosystems/tree/master/IHEC_Data_Hub

dzerbino commented 4 years ago

Desired result: single point of contact. A JSON is submitted to EpiRR that sends a template Portal JSON which is then filled by the team.

TODO:

dzerbino commented 4 years ago

What info can be dropped from the Portal JSON (assuming the portal can read it from EpiRR)?

{
   "datasets": {
        "experiment_1": {
            "experiment_attributes" // convert to ID
        },
        "experiment_2": {
            ...
        },
    }
    "samples": { ... }
}

What info needs to be retained:

{
    "hub_description": { ... },
    "datasets": {
        "experiment_1": {
            "sample_id": "...",
            "experiment_id":  "..." ,
            "analysis_attributes": { ... },
            "browser": { ... }
        },
        "experiment_2": {
           ...
        },
    }
}
sitag commented 4 years ago

@dzerbino They use the same schemas, so everything can be dropped as long as we keep the identifiers.

dbujold commented 4 years ago

Basically, what's needed is a way to link sample and experiment metadata, that would be obtained from EpiRR, to processed data (bigwigs, bigbeds) and data processing metadata (analysis_attributes), that would be provided to the IHEC Data Portal.