ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE149689 - PBMC_Healthy/Covid19/Flu #97

Closed ESapenaVentura closed 6 months ago

ESapenaVentura commented 4 years ago

Short name pbmcCov19Flu

Primary Wrangler: Enrique Secondary Wrangler: Ray

Associated files:

Google Drive: https://drive.google.com/drive/folders/1pPN9F9YGuw9bMHQBUi7wl18x53V0V_sO *Ingest

Key Events

Please track the below as well as the key events:

  1. Track date first spreadsheet received and final spreadsheet sent by editing ticket to include date next to event.
  2. Track spreadsheet iterations by placing asterisks next to receive spreadsheet event.
  3. Track any metadata issues/tickets made for dataset with a bulleted list of links under received spreadsheet event. Links should be to the ticket in the metadata repo.
ESapenaVentura commented 4 years ago

Metadata is valid: https://ui.ingest.archive.data.humancellatlas.org/submissions/detail?id=5f180039fe9c934c8b8343ef

Data files are being uploaded. The s3 bucket has been recorded in a google doc within the folder.

@ebi-ait/hca-wranglers Anyone wants to be the secondary wrangler and review this dataset?

ESapenaVentura commented 4 years ago

This dataset is blocked due to some (6/40) data files not being available.

Tried downloading manually those files with fasterq-dump but gives the following error for 1 of the files after 1 hour and a half:

2020-07-23T09:41:34 fasterq-dump.2.10.8 err: cmn_iter.c cmn_read_String( #694054913 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-07-23T09:41:35 fasterq-dump.2.10.8 err: cmn_iter.c cmn_read_uint8_array( #152199169 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-07-23T09:41:35 fasterq-dump.2.10.8 err: row #152199169 : READ.len(119) != QUALITY.len(0) (F)
2020-07-23T09:41:35 fasterq-dump.2.10.8 fatal: SIGNAL - Segmentation fault
fasterq-dump quit with error code 1

Which leads me to think the data files have not been replicated to the ena warehouse/made accessible directly through SRA because they are corrupt and this is a pretty new dataset.

lauraclarke commented 4 years ago

@ESapenaVentura did you try and get the files from the NCBI or just the ENA?

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11680226

ESapenaVentura commented 4 years ago

@lauraclarke I always try to get the files from the ENA warehouse (easier/more consistent to parse metadata).

For what I've seen so far, there are files who don't make it to ENA (Like the one you're pointing out), usually displaying SRR as the only viable option for download (All the others require certified aws/gc accounts)

I haven't tried with many files using fastq-dump (and derivates), but the times I've tried with this dataset fastq dump has given errors, so I'm assuming the files are somehow invalid.

Not sure if at this point we should contact SRA

lauraclarke commented 4 years ago

Thanks, unless it seems like a uber valuable dataset it probably isn't worth it

ESapenaVentura commented 4 years ago

it's only missing 2 out of 40 files now. I have deleted those from the spreadsheet and created a new one:

I think it's worth it to push it even if it's missing the files from a donor.

If @lauraclarke could confirm whether we should do it or not and a @ebi-ait/hca-wranglers could do the secondary wrangle, it would be awesome!

lauraclarke commented 4 years ago

Do we know what we have lost, is it more of the same or is it something unique?

ESapenaVentura commented 4 years ago

As stated in the operations meeting, we will proceed with this dataset and flag somehow that it's missing a donor.

Any @ebi-ait/hca-wranglers want to secondary review it?

ESapenaVentura commented 4 years ago

I have just checked and, over the weekend, they have added the last files for the last donor, so the dataset is actually complete. I will update the spreadsheet

ESapenaVentura commented 4 years ago

I have uploaded the spreadsheet with the full metadata under the name GSE149689_ontologies_full_20200824.xlsx

The last file is still uploading to the hca-util area

rays22 commented 4 years ago

Hi @ESapenaVentura, Would you like me to do the secondary wrangling for this dataset?

ESapenaVentura commented 4 years ago

Hi @rays22 @mshadbolt agreed to secondary review, worth chatting with her to see if she has started reviewing but if not and you both agree, happy for you to take it

rays22 commented 4 years ago
Aug 18, 2020, 7:20:30 AM    ee694c49-3099-4c46-a5ec-3c5472b307f3    Draft
ESapenaVentura commented 4 years ago

Hi @rays22 thanks for the review.

I am going to proceed without the donor as I am not able to retrieve the R2 file by any means.

The spreadsheet is the same, just without the biomaterials/files associated to the run SRR11680208

I'll start uploading files now.

ESapenaVentura commented 4 years ago

Leaving the ticket open as we have a missing donor.

Once we have the ability to update, we should look back into this dataset.

ESapenaVentura commented 3 years ago

This ticket is no longer blocked as in the last sprint @Wkt8 was able to download the last 3 files from the last donor.

Wkt8 commented 3 years ago

When making the update please see this ticket! https://github.com/ebi-ait/hca-ebi-wrangler-central/issues/260 Also make the change to project.project_core.project_description.

ESapenaVentura commented 2 years ago

Done with the update

I also added analysis files

Will need further updates on the library preparation protocols (Outdated labels)

ami-day commented 2 years ago

This dataset is suitable for SCEA. @ESapenaVentura I think you submitted this to HCA DCP last release in January? is this the latest version: https://docs.google.com/spreadsheets/d/1zgoOn7fdNt2niSTiLt97vz4NdSMTCvdd/edit#gid=1827791073 or a different google sheet is? I will do the pre-conversion via the hca2scea tool and upload the output files to the dataset folder.

ESapenaVentura commented 2 years ago

@ami-day it's a mix of this https://docs.google.com/spreadsheets/d/1C6zoBjmkyGbXV8RUR5l1FML_qzZs_GCL/edit?usp=sharing&ouid=105945877951113459382&rtpof=true&sd=true

and this https://docs.google.com/spreadsheets/d/1zTuYivT6k7-2C_zhXl6WJ_1sjblzjqJm/edit?usp=sharing&ouid=105945877951113459382&rtpof=true&sd=true

one of the donors was not available at the time of the original submission, so it had to be added afterwards in a new submission

ami-day commented 2 years ago

Assigned E-HCAD-id: E-HCAD-44 Pre-converted files are here: https://drive.google.com/drive/folders/1tvIBJa_hyvcMHVYO9RIRkk1AFb-ONpNT (both above sheets were combined before running the tool)

ami-day commented 2 years ago

Manually curated the files. Have uploaded them to Gitlab, merge request is here: https://gitlab.ebi.ac.uk/ebi-gene-expression/scxa-metadata/-/merge_requests/292

Note: E-HCAD has changed to E-HCAD-45

idazucchi commented 12 months ago

the update with donor2 never made it to the DCP - can we push this for R33?

rachadele commented 12 months ago

i am working on it, there were 2 misnamed files in the staging area but i think i figured out how to fix it

rachadele commented 12 months ago

someone deleted the submission with the updates, so now we will need to start from scratch unfortunately

idazucchi commented 12 months ago

I've added manually the missing donor and analysis files to the first submission to avoid export issues with multiple submissions. Exported data & metadata + filled import form

idazucchi commented 11 months ago

The new donor made it to the broswer! I'm checking if we can drop the matrix that was added to the project prior to the analysis schema so we don't have a duplicate of GSE149689_matrix.mtx.gz

idazucchi commented 10 months ago

Removing duplicate file from Data Portal

The same file was decribed twice:

  1. GSE149689_matrix.mtx.gz was originally added to the project as a supplementary file as the MVP implementation --> so matrix file + supplemenatry file metadata + descriptor + links.json
  2. GSE149689_matrix.mtx.gz was added as a part of the update as an analysis file

Solution After some discussion on slack (see here and here) we've agreed with Nate that the best solution to remove the extra files that come with point 1 is to soft delete and re-import the whole project. All the uuids will remain the same so it will be a small change for the users

I'm filling out the import form and requesting the soft deletion

idazucchi commented 9 months ago

the proposed solution could impact the DCP generated matrices generated for this project - this issue is discussed here along with a similar case, so I'll remove the ticket from the operations board

idazucchi commented 6 months ago

issue resolved