Closed ESapenaVentura closed 6 months ago
Metadata is valid: https://ui.ingest.archive.data.humancellatlas.org/submissions/detail?id=5f180039fe9c934c8b8343ef
Data files are being uploaded. The s3 bucket has been recorded in a google doc within the folder.
@ebi-ait/hca-wranglers Anyone wants to be the secondary wrangler and review this dataset?
This dataset is blocked due to some (6/40) data files not being available.
Tried downloading manually those files with fasterq-dump but gives the following error for 1 of the files after 1 hour and a half:
2020-07-23T09:41:34 fasterq-dump.2.10.8 err: cmn_iter.c cmn_read_String( #694054913 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-07-23T09:41:35 fasterq-dump.2.10.8 err: cmn_iter.c cmn_read_uint8_array( #152199169 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-07-23T09:41:35 fasterq-dump.2.10.8 err: row #152199169 : READ.len(119) != QUALITY.len(0) (F)
2020-07-23T09:41:35 fasterq-dump.2.10.8 fatal: SIGNAL - Segmentation fault
fasterq-dump quit with error code 1
Which leads me to think the data files have not been replicated to the ena warehouse/made accessible directly through SRA because they are corrupt and this is a pretty new dataset.
@ESapenaVentura did you try and get the files from the NCBI or just the ENA?
@lauraclarke I always try to get the files from the ENA warehouse (easier/more consistent to parse metadata).
For what I've seen so far, there are files who don't make it to ENA (Like the one you're pointing out), usually displaying SRR as the only viable option for download (All the others require certified aws/gc accounts)
I haven't tried with many files using fastq-dump (and derivates), but the times I've tried with this dataset fastq dump has given errors, so I'm assuming the files are somehow invalid.
Not sure if at this point we should contact SRA
Thanks, unless it seems like a uber valuable dataset it probably isn't worth it
it's only missing 2 out of 40 files now. I have deleted those from the spreadsheet and created a new one:
_without_SRR11680208
)I think it's worth it to push it even if it's missing the files from a donor.
If @lauraclarke could confirm whether we should do it or not and a @ebi-ait/hca-wranglers could do the secondary wrangle, it would be awesome!
Do we know what we have lost, is it more of the same or is it something unique?
As stated in the operations meeting, we will proceed with this dataset and flag somehow that it's missing a donor.
Any @ebi-ait/hca-wranglers want to secondary review it?
I have just checked and, over the weekend, they have added the last files for the last donor, so the dataset is actually complete. I will update the spreadsheet
I have uploaded the spreadsheet with the full metadata under the name GSE149689_ontologies_full_20200824.xlsx
The last file is still uploading to the hca-util area
Hi @ESapenaVentura, Would you like me to do the secondary wrangling for this dataset?
Hi @rays22 @mshadbolt agreed to secondary review, worth chatting with her to see if she has started reviewing but if not and you both agree, happy for you to take it
Spreadsheet version GSE149689_ontologies_full_20200824.xlsx
passed all the validation tests by the ingest-graph-validator
.
I have created an experimental design graph of GSE149689_ontologies_full_20200824.xlsx
using the ingest-graph-validator
. It looks all right to me:
https://drive.google.com/file/d/1a0QAwVy_qQ4nPCkktthxdcLd9E-4oDTn/view?usp=sharing
I can confirm that the correct ontologies have been used in the metadata spreadsheet.
I have not checked the data files.
I have not checked the Ingest UI submission:
Aug 18, 2020, 7:20:30 AM ee694c49-3099-4c46-a5ec-3c5472b307f3 Draft
Hi @rays22 thanks for the review.
I am going to proceed without the donor as I am not able to retrieve the R2 file by any means.
The spreadsheet is the same, just without the biomaterials/files associated to the run SRR11680208
I'll start uploading files now.
Leaving the ticket open as we have a missing donor.
Once we have the ability to update, we should look back into this dataset.
This ticket is no longer blocked as in the last sprint @Wkt8 was able to download the last 3 files from the last donor.
When making the update please see this ticket! https://github.com/ebi-ait/hca-ebi-wrangler-central/issues/260 Also make the change to project.project_core.project_description.
Done with the update
I also added analysis files
Will need further updates on the library preparation protocols (Outdated labels)
This dataset is suitable for SCEA. @ESapenaVentura I think you submitted this to HCA DCP last release in January? is this the latest version: https://docs.google.com/spreadsheets/d/1zgoOn7fdNt2niSTiLt97vz4NdSMTCvdd/edit#gid=1827791073 or a different google sheet is? I will do the pre-conversion via the hca2scea tool and upload the output files to the dataset folder.
@ami-day it's a mix of this https://docs.google.com/spreadsheets/d/1C6zoBjmkyGbXV8RUR5l1FML_qzZs_GCL/edit?usp=sharing&ouid=105945877951113459382&rtpof=true&sd=true
one of the donors was not available at the time of the original submission, so it had to be added afterwards in a new submission
Assigned E-HCAD-id: E-HCAD-44 Pre-converted files are here: https://drive.google.com/drive/folders/1tvIBJa_hyvcMHVYO9RIRkk1AFb-ONpNT (both above sheets were combined before running the tool)
Manually curated the files. Have uploaded them to Gitlab, merge request is here: https://gitlab.ebi.ac.uk/ebi-gene-expression/scxa-metadata/-/merge_requests/292
Note: E-HCAD has changed to E-HCAD-45
the update with donor2 never made it to the DCP - can we push this for R33?
i am working on it, there were 2 misnamed files in the staging area but i think i figured out how to fix it
someone deleted the submission with the updates, so now we will need to start from scratch unfortunately
I've added manually the missing donor and analysis files to the first submission to avoid export issues with multiple submissions. Exported data & metadata + filled import form
The new donor made it to the broswer! I'm checking if we can drop the matrix that was added to the project prior to the analysis schema so we don't have a duplicate of GSE149689_matrix.mtx.gz
The same file was decribed twice:
Solution After some discussion on slack (see here and here) we've agreed with Nate that the best solution to remove the extra files that come with point 1 is to soft delete and re-import the whole project. All the uuids will remain the same so it will be a small change for the users
I'm filling out the import form and requesting the soft deletion
the proposed solution could impact the DCP generated matrices generated for this project - this issue is discussed here along with a similar case, so I'll remove the ticket from the operations board
issue resolved
Short name pbmcCov19Flu
Primary Wrangler: Enrique Secondary Wrangler: Ray
Associated files:
Google Drive: https://drive.google.com/drive/folders/1pPN9F9YGuw9bMHQBUi7wl18x53V0V_sO *Ingest
Key Events
Please track the below as well as the key events: