ebi-ait / dcp-ingest-central

Central point of access for the Ingestion Service of the HCA DCP
Apache License 2.0
0 stars 0 forks source link

Migration for populating new project field in the biomaterial, process, protocol, file #90

Closed aaclan-ebi closed 3 years ago

prabh-t commented 3 years ago

The migration script is ready but because of inconsistencies in the data in the db, it hasn't been executed yet. Final checks and data clean up is required.

aaclan-ebi commented 3 years ago

@prabh-t We created this operations ticket ebi-ait/hca-ebi-wrangler-central#175 to remove the analysis submissions which should be the only source of those inconsistencies. I can run some checks on how many analysis metadata are there to be removed and we can compare it with the inconsistencies you found.

Noting here the inconsistencies in prod data you sent on slack:

#sub->proj: 71
#biomaterial: [95421, 0, 95416, 5, 0]
#file: [844830, 0, 290351, 554479, 0]
#process: [249764, 0, 235240, 14524, 0]
#protocol: [15037, 0, 502, 14535, 0]

// for each entity (biomaterial, file, process and protocol), the array represents: 
# total, already has proj ref, map, couldn't map, missing subEnv
// map means I was able to map this entity to a project through the subEnv associated with the entity.

So on prod we have 14609 submissions, only 2688 with the isUpdate property, and 13 marked update, the rest (2675) not update.
prabh-t commented 3 years ago

Great, thanks @aaclan-ebi! let me know once you've cleaned up the analysis submissions and then we can check the script again, hopefully the inconsistencies won't remain.

prabh-t commented 3 years ago

I am also slightly confused as the project view user story is dependent on this migration/ticket, how have we deploy the other component changes and make them work on dev and staging.

aaclan-ebi commented 3 years ago

The changes will only work for new data. The project property of all existing metadata are empty. If you view any projects created before the changes were deployed, it will appear that it has no metadata linked to it.

prabh-t commented 3 years ago

so promoting this to prod before the migration may not be a good idea.

aaclan-ebi commented 3 years ago

I did some checks/queries in our database to investigate those entities which cannot be mapped to a project:

For files:

// Find all analysis files
> db.file.find({
  'content.describedBy': {
     $regex: /analysis_file/
  }
}).count()

554479

Conclusion: We can assume that the 554479 which can't be mapped to a project are the analysis files. It is expected that these won't be mapped to any projects.

For processes:

// Find all analysis processes
> db.process.find({
  'content.describedBy': {
     $regex: /analysis_process/
  }
}).count()
14524

Conclusion: We can assume that the 14524 processes which can't be mapped to a project are the analysis processes. It is expected that these won't be mapped to any projects.

For biomaterials:

// Find all biomaterials which are update documents
> db.biomaterial.find({
  'isUpdate': true
},
{
    'submissionEnvelope': 1
}).count()
68

// Find all update submission envelopes containing those 68 update documents
// manually check if these update submissions has an update document for the project
// by using https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/<submission-object-id>/projects

> db.biomaterial.distinct( 'submissionEnvelope', {'isUpdate': true})
[
    DBRef("submissionEnvelope", ObjectId("5d7bd747bddf5d000898e7e5")),
    DBRef("submissionEnvelope", ObjectId("5d8900f3bddf5d000810cee5")), // has no project
    DBRef("submissionEnvelope", ObjectId("5e5fbf8ff499cd1e39f3b89f")),
    DBRef("submissionEnvelope", ObjectId("5e60e8e4f02bb753344c65c1")),
    DBRef("submissionEnvelope", ObjectId("5e610f0ef02bb753344c65d7"))
]

// Find all biomaterials which are update documents in the submission which has no project
> db.biomaterial.find({
  'isUpdate': true,
  'submissionEnvelope.$id': ObjectId("5d8900f3bddf5d000810cee5")
}).count()
5

Conclusion: The 5 biomaterials which can't be mapped to any projects are those 5 update documents that don't have a project in their update submission. It is expected that these won't be mapped to any projects.

For protocols:

// Find all analysis protocols
> db.protocol.find({
  'content.describedBy': {
     $regex: /analysis_protocol/
  }
}).count()
14524
> db.protocol.find({
  'isUpdate': true
}).count()
18

db.protocol.find({
  'isUpdate': true
},
{
  'submissionEnvelope': 1
})

// manually checked if these update submissions has an update document for the project
// by using https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/<submission-object-id>/projects

{ "_id" : ObjectId("5d7bd74bbddf5d000898e831"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5d7bd747bddf5d000898e7e5")) } // has a project
{ "_id" : ObjectId("5e32e48b0abaea1b785b7a3b"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e4860abaea1b785b7a39")) }
{ "_id" : ObjectId("5e32e48b0abaea1b785b7a3d"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e4860abaea1b785b7a39")) }
{ "_id" : ObjectId("5e32e48b0abaea1b785b7a3f"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e4860abaea1b785b7a39")) }
{ "_id" : ObjectId("5e32e4ce0abaea1b785b7a43"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e4c90abaea1b785b7a41")) }
{ "_id" : ObjectId("5e32e4ce0abaea1b785b7a45"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e4c90abaea1b785b7a41")) }
{ "_id" : ObjectId("5e32e4ce0abaea1b785b7a47"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e4c90abaea1b785b7a41")) }
{ "_id" : ObjectId("5e32e57d0abaea1b785b7a4b"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e5780abaea1b785b7a49")) }
{ "_id" : ObjectId("5e32e57d0abaea1b785b7a4d"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e5780abaea1b785b7a49")) }
{ "_id" : ObjectId("5e32e57d0abaea1b785b7a4f"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e32e5780abaea1b785b7a49")) }
{ "_id" : ObjectId("5e5fbfa3f499cd1e39f3b8a7"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e5fbf8ff499cd1e39f3b89f")) } // has a project
{ "_id" : ObjectId("5e60e8eef02bb753344c65d3"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e60e8e4f02bb753344c65c1")) } // has a project
{ "_id" : ObjectId("5e610f14f02bb753344c65ff"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e610f0ef02bb753344c65d7")) } // has a project
{ "_id" : ObjectId("5e610f14f02bb753344c6601"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e610f0ef02bb753344c65d7")) } // has a project
{ "_id" : ObjectId("5e610f14f02bb753344c6603"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e610f0ef02bb753344c65d7")) } // has a project
{ "_id" : ObjectId("5e610f14f02bb753344c6605"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e610f0ef02bb753344c65d7")) } // has a project
{ "_id" : ObjectId("5e627f3cf02bb753344c76a5"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e627f35f02bb753344c76a1")) } // has a project
{ "_id" : ObjectId("5e627f3cf02bb753344c76a7"), "submissionEnvelope" : DBRef("submissionEnvelope", ObjectId("5e627f35f02bb753344c76a1")) } // has a project

Conclusion:

aaclan-ebi commented 3 years ago

TLDR: Only protocols seem to have issues. All those inconsistencies wherein the metadata can't be mapped to a project for biomaterials(5), files (554479), process(14524) are expected.

@prabh-t would it be possible to get a list of the protocol object ids of those 14535? I'll compare it against the 14532 i found and find those 3 protocols and investigate why they don't have a project.

prabh-t commented 3 years ago

I can get you the protocol ids. Are we going to clear the analysis submissions in prod? At some point we may need to migrate

prabh-t commented 3 years ago

unmapped protocol ids https://app.zenhub.com/files/299258791/ec0faf85-4961-46e3-a044-0f29479b067d/download

aaclan-ebi commented 3 years ago

@prabh-t yes, we're going to delete analysis submissions. We may actually want to delete the update submissions as well. All updates should be in the original documents so i think it's fine.

I'll prioritise deleting the analysis submissions for now. Once we remove analysis submissions, the only source of inconsistencies will now be the update documents which are also expected not to be mapped to a project if there is no project update document in the update submission envelope.

Yeah, let's make those migration changes you mentioned as part of the steps when we deprecate submission envelopes.

aaclan-ebi commented 3 years ago

Comparing the protocols that can't be matched from Prabhat against the protocols I found which should be expected to not be linked to any project:

update-protocols.json analysis-protocols.json

I found the 3 protocols: < 5e32e57d0abaea1b785b7a4f from a valid update submission - yes this can't be mapped to a project < 5e627f3cf02bb753344c76a5 from an archiving - this could be mapped to the project < 5e627f3cf02bb753344c76a7 from an archiving - this could be mapped to the project

@prabh-t all the analysis submissions are now gone in prod. Should we proceed with the migration? I think we should promote the Core changes first before migration so that it'll create the index first.

I'll confirm when can we do the production release.

prabh-t commented 3 years ago

Great! let me know when we're ready to migrate.

aaclan-ebi commented 3 years ago

@prabh-t We're good to go for migration! 👍

@jacobwindsor has already promoted the changes to prod.

aaclan-ebi commented 3 years ago

@prabh-t Have you checked in the migration scripts somewhere?

It would be nice to run this before the DCP demo tomorrow.

prabh-t commented 3 years ago

@aaclan-ebi this has taken a long time to execute and verify, and I can confirm prod is now migrated. In total, 621,889 docs were affected and updated to have a project reference. There are a number of biomaterials and protocols that couldn't be mapped to a project. See listed below. Do you have any idea why this may be the case? Are these orphans/documents that need to be cleaned up?

biomaterials (5)
_id 5d8900f6bddf5d000810cee7 (uuid ff42139d-d2b9-3b89-9090-490fe42fed88) sub_id 5d8900f3bddf5d000810cee5
_id 5d8900f6bddf5d000810cee9 (uuid 204dca30-c660-971a-7bd0-b74acc72af81) sub_id 5d8900f3bddf5d000810cee5
_id 5d8900f6bddf5d000810ceeb (uuid d0465a79-d5fc-da85-ce08-912e980f7ea1) sub_id 5d8900f3bddf5d000810cee5
_id 5d8900f6bddf5d000810ceed (uuid 254796df-8ab2-9bf6-0c45-c4636ee83399) sub_id 5d8900f3bddf5d000810cee5
_id 5d8900f6bddf5d000810ceef (uuid 3644a1e7-b0e6-0515-c898-81f7545479ba) sub_id 5d8900f3bddf5d000810cee5

protocols (11)
_id 5e32e48b0abaea1b785b7a3b (uuid 0b4956d6-e67c-da64-d92f-557ca3db2da1) sub_id 5e32e4860abaea1b785b7a39
_id 5e32e48b0abaea1b785b7a3d (uuid d946384c-0d53-065e-632b-5c1f991fa3b5) sub_id 5e32e4860abaea1b785b7a39
_id 5e32e48b0abaea1b785b7a3f (uuid de49f149-cd07-e0fd-d306-65cdb7b255b0) sub_id 5e32e4860abaea1b785b7a39
_id 5e32e4ce0abaea1b785b7a43 (uuid 0b4956d6-e67c-da64-d92f-557ca3db2da1) sub_id 5e32e4c90abaea1b785b7a41
_id 5e32e4ce0abaea1b785b7a45 (uuid d946384c-0d53-065e-632b-5c1f991fa3b5) sub_id 5e32e4c90abaea1b785b7a41
_id 5e32e4ce0abaea1b785b7a47 (uuid de49f149-cd07-e0fd-d306-65cdb7b255b0) sub_id 5e32e4c90abaea1b785b7a41
_id 5e32e57d0abaea1b785b7a4b (uuid 0b4956d6-e67c-da64-d92f-557ca3db2da1) sub_id 5e32e5780abaea1b785b7a49
_id 5e32e57d0abaea1b785b7a4d (uuid d946384c-0d53-065e-632b-5c1f991fa3b5) sub_id 5e32e5780abaea1b785b7a49
_id 5e32e57d0abaea1b785b7a4f (uuid de49f149-cd07-e0fd-d306-65cdb7b255b0) sub_id 5e32e5780abaea1b785b7a49
_id 5e627f3cf02bb753344c76a5 (uuid 694f7f79-419c-396f-688e-4768d4cb1987) sub_id 5e627f35f02bb753344c76a1
_id 5e627f3cf02bb753344c76a7 (uuid 884a9f48-7ec6-2372-ea0d-c5bc5371289f) sub_id 5e627f35f02bb753344c76a1
lauraclarke commented 3 years ago

Are there human assigned ids or other useful information about these objects which would help identify them?

prabh-t commented 3 years ago

@lauraclarke These are mongodb unique identifiers (_id) for the documents. I've included the uuids in the list above.

lauraclarke commented 3 years ago

I was more thinking about the stuff which actually contains context like the biomaterial_ids or protocol names. We can't tell what the uuid, a biomaterial id or protocol name you might be able to search the google docs or github tickets for

prabh-t commented 3 years ago

You can open the document and find out more via the ingest api: https://api.ingest.archive.data.humancellatlas.org/biomaterials/<_id> https://api.ingest.archive.data.humancellatlas.org/protocols/<_id> does this help?

lauraclarke commented 3 years ago

Is there anyway to search in ingest for a given biomaterial_id or similar

https://api.ingest.archive.data.humancellatlas.org/biomaterials/5d8900f6bddf5d000810cee7 Gives us "biomaterial_name" : "Patient 1 Total Liver Homogenate",

which I am fairly certain comes from https://data.humancellatlas.org/explore/projects/4d6f6c96-2a83-43d8-8fe1-0f53bffd4674?catalog=dcp2 so is likely a duplicate record which wasn't cleaned up properly

prabh-t commented 3 years ago

Is there anyway to search in ingest for a given biomaterial_id or similar

Does the link/api endpoint above not do that, search for a given biomaterial_id? You can follow the links at the end of the document for linked documents/entities of the given biomaterial. for e.g.

https://api.ingest.archive.data.humancellatlas.org/biomaterials/5d8900f6bddf5d000810cee7/project  -- NO PROJECT AS EXPECTED
https://api.ingest.archive.data.humancellatlas.org/biomaterials/5d8900f6bddf5d000810cee7/inputToProcesses  -- NO INPUT
https://api.ingest.archive.data.humancellatlas.org/biomaterials/5d8900f6bddf5d000810cee7/derivedByProcesses  -- NO OUTPUT
https://api.ingest.archive.data.humancellatlas.org/biomaterials/5d8900f6bddf5d000810cee7/submissionEnvelope

We can find all other biomaterials (and other entities) for the particular submission if that helps.

From the links, it looks like this biomaterial has no input/output and is associated to an update submission which doesn't refer to a project (which I think is expected as per @aaclan-ebi comment above - based on how this has been modelled and implemented in the past). The unmapped protocols are what need to be accounted for.

lauraclarke commented 3 years ago

I guess I want to search by the schema field biomaterial_name "Patient 1 Total Liver Homogenate" and not by biomaterial_id "5d8900f6bddf5d000810cee7" I suspect we have more than one entity with the biomateria_name "Patient 1 Total Liver Homogenate" which means this can be safely deleted

For anything where there is only one copy, more investigation will be needed to figure out if that is a problem

Alternatively, if we have backups we could decide to delete them all and see if anything odd happens

aaclan-ebi commented 3 years ago

Thanks for the update on migration @prabh-t ! And thanks for checking those metadata @lauraclarke

These metadata are from update submissions which don't have any projects linked to them because they don't have any updates to the project. I noticed some have been submitted and applied to original documents but some are not.

  1. https://contribute.data.humancellatlas.org/submissions/detail?id=5d8900f3bddf5d000810cee5 - complete I believe this has already been submitted to DCP & EMBL-EBI Archives and there are no further actions needed.

  2. https://contribute.data.humancellatlas.org/submissions/detail?id=5e32e4860abaea1b785b7a39 - valid,

  3. https://contribute.data.humancellatlas.org/submissions/detail?id=5e32e4c90abaea1b785b7a41- valid

  4. https://contribute.data.humancellatlas.org/submissions/detail?id=5e32e5780abaea1b785b7a49 - valid

Submission 2 & 3 & 4 are the same and have the following updates but are not submitted It belongs to project: HumanColonicMesenchymeIBD https://api.ingest.archive.data.humancellatlas.org/projects/5d3b11bf9be88c0008a9d746

It has 3 protocols which has the following updates

protocol uuid: 64da7ce6-d656-490b-a12d-dba37c552fd9

field: library_construction_method.text
old value: "Smart-Seq"
new value:  "Smart-seq"

field: library_construction_method.ontology_label
old value: "Smart-Seq"
new value:  "Smart-seq"
protocol uuid:  5e06530d-4c38-46d9-b5a3-1f991f5c2b6

field: library_construction_kit.catalog_number
old value: "10X Genomics"
new value:  "PN-120237"

field: library_construction_kit.manufacturer
old value: "PN-120237"
new value:  "10x Genomics"
protocol uuid:  fde007cd-49f1-49de-b055-b2b7cd6506d3 

field: library_construction_kit.catalog_number
old value: "10X Genomics"
new value:  "PN-120237"

field: library_construction_kit.manufacturer
old value: "PN-120237"
new value:  "10x Genomics"

This is a DCP1 dataset. I realised DCP1 datasets are currently in the UCSC(Hannes' team) terra staging bucket. We don't know yet how to propagate updates for DCP1 datasets. We could just keep a note of this and apply the updates when we are able to.

  1. https://contribute.data.humancellatlas.org/submissions/detail?id=5e627f35f02bb753344c76a1 - archiving

Update was sent to archives but not to DCP (DCP 1 dataset).

Protocol uuid: 7223c67e-489f-4a88-9f28-7153bcc50dea
field: "input_nucleic_acid_molecule.ontology_label",
new value: "polyA RNA extract"
old value: unknown
Protocol uuid: 6f399c41-797f-4f69-8719-cbd468478e68
field: input_nucleic_acid_molecule.ontology_label
New value: "polyA RNA extract"
Old value: unknown

field: library_construction_method.text
new value: "10X 3' v2 sequencing"
Old value: unknown

field: library_construction_method.ontology
new value: "EFO:0009899"
Old value: unknown

field: library_construction_method.ontology_label
new value: "10X 3' v2 sequencing"
Old value: unknown

Currently, we're not going into the direction of using update submissions to do post submission updates. But things are not final yet and we're still working on updates implementation. I believe there's no harm deleting these update submissions (needs a separate ticket for update submissions deletion). We'll just lose that audit info of those past DCP1 updates which I guess no one cares about since they're very minor/correction updates ( are probably tracked in another ticket). For those updates which weren't applied yet, we just keep a note of them and apply them once we can.

aaclan-ebi commented 3 years ago

Hi @prabh-t, I looked at some projects in the UI created before the project view was implemented and the lists of metadata and data files in those projects are still empty.

Sample project: https://contribute.data.humancellatlas.org/projects/detail?uuid=abe1a013-af7a-45ed-8c26-f3793c24a1f4

There's also an error viewing the project link from the metadata: https://api.ingest.archive.data.humancellatlas.org/biomaterials/5daf405471fe4a0008e325b1/project

Could this be an issue related to the migration?

aaclan-ebi commented 3 years ago

It looks like the metadata documents are updated with incorrect value for dbrefs

Instead of:

"project": DBRef("project", "5cd588a9d96dad00085634a4")

it should be

"project": DBRef("project", ObjectId("5cd588a9d96dad00085634a4"))

@prabh-t is it possible to simply rerun the migration script ?

Example biomaterial in the database:

{
    "_id": ObjectId("5cd588aad96dad00085634b8"),
    "_class": "org.humancellatlas.ingest.biomaterial.Biomaterial",
    "content": {
        "describedBy": "https://schema.humancellatlas.org/type/biomaterial/13.1.0/cell_suspension",
        "schema_type": "biomaterial",
        "biomaterial_core": {
            "biomaterial_id": "GSM2171883 1",
            "biomaterial_description": "Single cell from human pancreas",
            "ncbi_taxon_id": [9606],
            "insdc_sample_accession": "SRS1458606"
        },
        "genus_species": [{
            "text": "Homo sapiens",
            "ontology": "NCBITaxon:9606",
            "ontology_label": "Homo sapiens"
        }],
        "estimated_cell_count": 1
    },
    "validationState": "VALID",
    "validationErrors": [],
    "version": NumberLong(9),
    "submissionDate": ISODate("2019-05-10T14:20:26.039Z"),
    "updateDate": ISODate("2019-05-10T14:27:12.176Z"),
    "user": "anonymousUser",
    "lastModifiedUser": "anonymousUser",
    "type": "BIOMATERIAL",
    "uuid": {
        "uuid": BinData(3, "AkllxhZ07aGMCEfJoOrlog==")
    },
    "events": [],
    "projects": [DBRef("project", ObjectId("5cd588a9d96dad00085634a4"))],
    "inputToProcesses": [DBRef("process", ObjectId("5cd58975d96dad0008565c84")), DBRef("process", ObjectId("5cd58975d96dad0008565c84"))],
    "derivedByProcesses": [DBRef("process", ObjectId("5cd58939d96dad0008565294"))],
    "isUpdate": false,
    "submissionEnvelope": DBRef("submissionEnvelope", ObjectId("5cd588a4d96dad00085634a2")),
    "project": DBRef("project", "5cd588a9d96dad00085634a4")
}
aaclan-ebi commented 3 years ago

Will reopen, there's an issue in the data that we need to fix. Could be looked into by the operations.