ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE145926 - Covid19BALFLandscape #1111

Open arschat opened 1 year ago

arschat commented 1 year ago

Project short name: Covid19BALFLandscape

Primary Wrangler:

Arsenios

Secondary Wrangler:

Ida

Associated files

Published study links

Key Events

arschat commented 1 year ago

RDS analysis file, includes data from another publicly available dataset (here), but not wrangled in DCP & not eligible.

I will try to find metadata from specific donor and include in spreadsheet.

arschat commented 1 year ago

No metadata available for the extra donor. Used a dummy donor/specimen & CS for this.

There is ambiguity about the sequence machine used.

The constructed library was sequenced on a BGI MGISEQ-2000 or Illumina platform.

In GEO, only BGI is mentioned, however it is not in the EFO ontology for high throughput sequencers. It exists in the GENEPIO ontology http://purl.obolibrary.org/obo/GENEPIO_0100144.

An ontology request should be made.

arschat commented 1 year ago

CellxGene wrangling requirements

(genes are in gene symbol format in the count matrices)

arschat commented 1 year ago

Ontology request needed, leave for next release.

arschat commented 1 year ago

Ontology request made but even if new term is added, OLS update should be made to proceed.

arschat commented 1 year ago

Ontology term added, will be available in EFO in next release. BGI MGISEQ-2000 -> EFO:0700018

idazucchi commented 1 year ago

this dataset is potentially affected by the deletion of data from ncbi-cloud-data bucket - can you check @arschat ?

arschat commented 1 year ago

Files had already been uploaded in the hca-util area 9d41ab58-c57d-4804-926f-3d63275ed913

arschat commented 1 year ago

Will try to push project despite the ontology stalled.

arschat commented 1 year ago

Authors replied:

We use BGISEQ-500 to sew.

Therefore we should proceed with the other option.

arschat commented 1 year ago

Uploaded file with swapped the insdc_project_accession and insdc_study_accession but fix that with the following code through api:

from hca_ingest.api.ingestapi import IngestApi token = "" api = IngestApi(url="https://api.ingest.archive.data.humancellatlas.org/") api.set_token(token) project_url = "https://api.ingest.archive.data.humancellatlas.org/projects/60068ceec9762f5f0de9f719" headers_json = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token} results = api.get(project_url).json() results['content']['insdc_project_accessions'] = ['SRP281979'] results['content']['insdc_study_accessions'] = ['PRJNA662785'] api.put(project_url, headers=headers_json, json=results)

Seems correct in ingest now.

Waiting for files transfer and if graph valid, it will be ready for secondary review.

arschat commented 1 year ago

Some files in upload-area seems to be invalid, size for some fastq.gz files does not match the size mentioned in ncbi probably duplicated R1 to R2 of same library. I requested a new ncbi cloud transfer.

arschat commented 1 year ago

Downloaded SRA Lite fastqs instead of original files. If I redownload the correct files, I will surpass my monthly limit. Will ask other wrangler to download through their account these SRA files.

arschat commented 12 months ago

Thanks to Ida, correct files have been downloaded.

arschat commented 12 months ago

re-triggered validation using script here, seems stuck in same file (C51_R2.fastq.gz)

arschat commented 12 months ago

Deleted submission & re-submit.

idazucchi commented 11 months ago

Nice job! I have a few suggestions for information you can add

Project

you can add the visualisation portal to the supplementary links

Donor

Specimen

CS

Sequence file

Analysis protocol

Analysis files

arschat commented 11 months ago

About fastq compressing. Data files were very large (~4 TB) and have been transferred through s3 buckets (ncbi to hca-util to upload-prod), and therefore in order to be compressed should be downloaded locally or on EC2, compressed and re-uploaded which would take a lot of time. Since there were also previous problems with the file validation, I didn't want to play too much with those. For those reasons I skipped the fastq compression.

All other changes were submitted (Donor timecourse -> Symptoms to Outcome/ Specimen timecourse -> Symptoms to Sampling date). Did not found other term for TCR files.

Submission is now on Submitted state and I when it will be exported I will sent the import form.

arschat commented 10 months ago

Verified in browser, however, sequencing_protocol.instrument_manufacturer_model needs update when a new OLS is available. BGI MGISEQ-2000 -> EFO:0700018

arschat commented 5 days ago

Since ols update is completed, we can now add the correct sequencer information.

arschat commented 5 days ago