HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

`ega_accession` values are inconsistent in project c715cd2f-dc7c-44a6-9cd5-b6a6d9f075ae #49

Closed amarjandu closed 2 years ago

amarjandu commented 2 years ago

Affected Project: c715cd2f-dc7c-44a6-9cd5-b6a6d9f075ae

The value of the project.ega_accessions field is currently a ; delimited string, this is inconsistent with the other _accessions fields.

From the query

select content
from `tdr-fp-546ade29.hca_prod_20201120_dcp2___20210910_dcp9.project` 
where project_id = 'c715cd2f-dc7c-44a6-9cd5-b6a6d9f075ae'

The trimmed output is

{
    "array_express_accessions": [
        "E-MTAB-8410"
    ],
    ...
    "geo_series_accessions": [
        "GSE132465",
        "GSE144735",
        "GSE132257"
    ],
    "insdc_project_accessions": [
        "ERP117727"
    ],
    "insdc_study_accessions": [
        "PRJNA548146",
        "PRJNA604751",
        "PRJNA546616"
    ],
    "ega_accessions": [
        "EGAS00001003779; EGAS00001003769"
    ],
   ...
}

I would expect the ega_accessions value to be two strings:

    "ega_accessions": [
        "EGAS00001003779",
        "EGAS00001003769"
    ],

There may be additional projects affected by this issue. Additionally projects with dbgap_accessions might also be affected as the metadata-schema shows a similar pattern/example. https://schema.humancellatlas.org/type/project/15.0.0/project

Raised https://github.com/HumanCellAtlas/metadata-schema/issues/1425 to update the update the pattern to include line anchors.

theathorn commented 2 years ago

@amarjandu to notify EBI (Enrique & Gabby) on Slack, and add link to Slack response here.

amarjandu commented 2 years ago

Slack thread: https://humancellatlas.slack.com/archives/C01360XN04S/p1635188437015800

ESapenaVentura commented 2 years ago

https://github.com/ebi-ait/hca-ebi-wrangler-central/issues/315 for context on the actions taken on the dataset

tl;dr: should have the proper metadata next release.

A schema update is necessary to avoid this in the future, as stated in your ticket @amarjandu https://github.com/HumanCellAtlas/metadata-schema/issues/1425

theathorn commented 2 years ago

Both the issues Enrique references above are now closed.

melainalegaspi commented 2 years ago

Spike to verify resolution.

amarjandu commented 2 years ago

Fixed! See:

Screen Shot 2022-01-21 at 8 22 55 AM

🚀