ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Update DOIs for some projects #1191

Open ESapenaVentura opened 9 months ago

ESapenaVentura commented 9 months ago

Description of the issue

As per the message left by Jason on slack, there are a set of projects with DOI's that are either incorrect or need update:

arschat commented 8 months ago

Projects that are not in exported status. All set in graph valid, besides Reprogrammed_Dendritic_Cells which needs update #1204

Project Status
Human prefrontal cortex gene regulatory dynamics from gestation to adulthood at single-cell resolution Complete
Pro-inflammatory T helper 17 directly harms oligodendrocytes in neuroinflammation Complete
Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics Exporting
Single cell profiling of human induced dendritic cells generated by direct reprogramming of embryonic fibroblasts Graph invalid
Single-cell Transcriptome Atlas of the Human Corpus Cavernosum Complete
Single-cell transcriptomic and proteomic analysis of Parkinson’s disease brains Complete
arschat commented 8 months ago
project_short_name ingest_status project_uuid dcp ingest
CD4TCellsInCrohnsDisease Exported c844538b-8854-4a95-bd01-aacbaf86d97f Valid Publication Link Fixed
earlyHumanEmbryogenesisAtlas Exported e255b1c6-1143-4fa6-83a8-528f15b41038 Biorxiv added publication
CryoPancreaticIsletCellPatchSeq Exported 8559a8ed-5d8c-4fb6-bde8-ab639cebf03c No Link or DOI Fixed
TCellsNeuroinflammation Exported 41fb1734-a121-4616-95c7-3b732c9433c7 Unspecified No Publication Data
HewittRetinalOrganoids Exported 77780d56-03c0-481f-aade-2038490cef9f Unspecified No Publication Data
NasaSpaceMicePbmc Exported a2a2f324-cf24-409e-a859-deaee871269c Unspecified No Publication Data
NasaSpaceMiceSpleens Exported aff9c3cd-6b84-4fc2-abf2-b9c0b3038277 Unspecified No Publication Data
HumanDecidualLeukocytes Exported c302fe54-d22d-451f-a130-e24df3d6afca Unspecified No Publication Data
deciduaPregnancyLoss Exported 3cfcdff5-dee1-4a7b-a591-c09c6e850b11 Valid Publication Link No DOI
GompertsAirwatCfCells Exported e526d91d-cf3a-44cb-80c5-fd7676b55a1d Valid Publication Link Valid Link & DOI
HumanPancreaticIslets Exported 78b2406d-bff2-46fc-8b61-20690e602227 2 Valid Publications Links No DOI for Neither Publication
humanCorticalDevelopmentLandscape Complete 77dedd59-1376-4887-9bca-dc42b56d5b7a Valid Publication Link Typo in DOI
AtlasOfTheHumanCorpusCavernosum Exported 5b910a43-7fb5-4ea7-b9d6-43dbd1bf2776 No Link or DOI Valid Link & DOI
ZhangLabPdBrainNuclei Exported 9a23ac2d-93dd-4bac-9bb8-040e6426db9d Unspecified No Publication Data
humanOligodendrocytesCulture Exporting ede2e0b4-6652-464f-abbc-0b2d964a25a0 Valid Publication Link Typo in DOI
Reprogrammed_Dendritic_Cells Graph invalid 116965f3-f094-4769-9d28-ae675c1b569c Unspecified No Publication Data
Fetal/Maternal Interface Graph validating f83165c5-e2ea-4d15-a5cf-33f3550bffde Valid Publication Link No Publication Data
idazucchi commented 8 months ago

Enrique and I are exporting GompertsAirwatCfCells for the lung atlas, I've updated the doi as well and will check off the box as soon as the export has gone through

arschat commented 8 months ago

Hit validation for Fetal/Maternal Interface but takes some time to validate (Note to self @arschat , to see on monday if is stuck).

Update: Monday 201123 still stuck in graph validating

arschat commented 8 months ago

Update on 21 Nov 3 remaining projects

Short Name uuid State Fixed doi in ingest
Fetal/Maternal Interface f83165c5-e2ea-4d15-a5cf-33f3550bffde Graph validating True
humanOligodendrocytesCulture ede2e0b4-6652-464f-abbc-0b2d964a25a0 Exporting True
Reprogrammed_Dendritic_Cells 116965f3-f094-4769-9d28-ae675c1b569c Metadata valid True
idazucchi commented 8 months ago

if we have capcity this week we can look into the stuck projects - worst case scenario Fetal/Maternal Interface can be pushed to graph valid manually retry exporting for humanOligodendrocytesCulture

ESapenaVentura commented 8 months ago
Short Name uuid State Comment
Fetal/Maternal Interface f83165c5-e2ea-4d15-a5cf-33f3550bffde Graph valid Already been validated in the past - No need to re-run it now since you only modified project
humanOligodendrocytesCulture ede2e0b4-6652-464f-abbc-0b2d964a25a0 Exported Got stuck in exporting - But project metadata has been exported, so we're cool
Reprogrammed_Dendritic_Cells 116965f3-f094-4769-9d28-ae675c1b569c Graph valid Pushed to graph valid - Will need update in the future
arschat commented 8 months ago
ESapenaVentura commented 7 months ago

Fetal/Maternal interface dataset has the issue described for DCP1 dataset: Missing fields in the fastq metadata.There is a script to fill in the missing fields: https://github.com/ebi-ait/hca-ebi-dev-team/tree/master/scripts/fill_dcp1_file_metadata

arschat commented 7 months ago

While trying to run the script fill_dcp1_file_metadata in dry-run mode for the Fetal/Maternal Interface, after some of the 15290 files get fixed I got the following error:

Error message > $ python fill_dcp1_metadata.py -p f83165c5-e2ea-4d15-a5cf-33f3550bffde -d A log of the operation will be saved in f83165c5-e2ea-4d15-a5cf-33f3550bffde.log 1%|▌ | 229/15290 [03:56<4:19:31, 1.03s/it] Traceback (most recent call last): File "/Users/arsenios/Documents/GitHub/hca-ebi-dev-team/scripts/fill_dcp1_file_metadata/fill_dcp1_metadata.py", line 182, in patch_file_metadata entity = ingest_api.get_entity_by_uuid('files', uuid) File "/Users/arsenios/miniconda3/envs/fill_dcp1/lib/python3.10/site-packages/hca_ingest/api/ingestapi.py", line 200, in get_entity_by_uuid return self.get(url, params=params).json() File "/Users/arsenios/miniconda3/envs/fill_dcp1/lib/python3.10/site-packages/hca_ingest/api/ingestapi.py", line 71, in get response.raise_for_status() File "/Users/arsenios/miniconda3/envs/fill_dcp1/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: for url: https://api.ingest.archive.data.humancellatlas.org/files/search/findByUuid?uuid=5a47f2dc-55cb-4a10-b14c-8e1738e92a5c > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): File "/Users/arsenios/Documents/GitHub/hca-ebi-dev-team/scripts/fill_dcp1_file_metadata/fill_dcp1_metadata.py", line 255, in main(args.project_uuid, args.dry_run) File "/Users/arsenios/Documents/GitHub/hca-ebi-dev-team/scripts/fill_dcp1_file_metadata/fill_dcp1_metadata.py", line 246, in main metadata = patch_file_metadata(uuid_metadata_map, ingest_api, dry_run) File "/Users/arsenios/Documents/GitHub/hca-ebi-dev-team/scripts/fill_dcp1_file_metadata/fill_dcp1_metadata.py", line 184, in patch_file_metadata

I run the script multiple times and I get that error different file each time.

arschat commented 7 months ago

Enrique has started downloading the files in order to re-submit via the update-uuids script.

idazucchi commented 7 months ago

current solution for Fetal/Maternal interface : reconstruct submission with the script and DCP2ify the dataset

idazucchi commented 7 months ago

Enrique to modify the script to create each type of entity separately while keeping track of the uuid changes - when capacity allows, probably in the new year

ESapenaVentura commented 6 months ago

The script was already modified to be re-launched - and it works! However, the second part of the script creates the linking and relies on accessing several thousand of entities through the API, one by one, so it takes a big amount of time to execute

Currently modifying the script so that this data can be retrieved previously/stored during execution and accessed from memory. Will push the changes once I confirm they work

ESapenaVentura commented 6 months ago

Putting the Fetal/Maternal interface on hold.

Doing it completely programmatically is borderline impossible with the amount of times the script needs to be retried and the manual work to ensure retries are clean.

I had another idea to make the process cleaner and easier: I have created another 2 scripts, complimentary to the first. Instead of submitting programmatically, we can submit to ingest, creating a submission with completely new UUIDs, and the new scripts will create a map by comparing IDs in the spreadsheet AND the submission.

Problem is, creating the submission in ingest has crashed it, giving an error and the submission can't be deleted.

I need to prioritize other work at the moment, but @arschat or @idazucchi if you can help me create the submission correctly (I can send you the spreadsheet) I should be able to finish the job!

arschat commented 6 months ago

Ida deleted the submission, and I created a new submission with this spreadsheet. fetalMaternal_interface.xlsx

However, the amount of entities crushed ingest (too many kubernetes requests).

('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

It will be put on hold for now due to low dev capacity, but will have to tackle that again in a couple of weeks.

arschat commented 5 months ago

As discussed in ops review for the Reprogrammed_Dendritic_Cells #1204 the protocol that caused graph invalid is fixed, and exported only metadata so the publication DOI will be updated. The whole project update however, is yet to be completed.

Since the DOI is fixed, we can consider this project as complete here.

arschat commented 5 months ago

@gabsie to inform @amnonkhen about the Fetal/ Maternal Interface problem, but put on hold for now.