ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE145926 - Covid19BALFLandscape #1111

Open arschat opened 1 year ago

arschat commented 1 year ago

Project short name: Covid19BALFLandscape

Primary Wrangler:

Arsenios

Secondary Wrangler:

Ida

Associated files

Published study links

Key Events

arschat commented 1 year ago

RDS analysis file, includes data from another publicly available dataset (here), but not wrangled in DCP & not eligible.

I will try to find metadata from specific donor and include in spreadsheet.

arschat commented 1 year ago

No metadata available for the extra donor. Used a dummy donor/specimen & CS for this.

There is ambiguity about the sequence machine used.

The constructed library was sequenced on a BGI MGISEQ-2000 or Illumina platform.

In GEO, only BGI is mentioned, however it is not in the EFO ontology for high throughput sequencers. It exists in the GENEPIO ontology http://purl.obolibrary.org/obo/GENEPIO_0100144.

An ontology request should be made.

arschat commented 1 year ago

CellxGene wrangling requirements

(genes are in gene symbol format in the count matrices)

arschat commented 1 year ago

Ontology request needed, leave for next release.

arschat commented 1 year ago

Ontology request made but even if new term is added, OLS update should be made to proceed.

arschat commented 1 year ago

Ontology term added, will be available in EFO in next release. BGI MGISEQ-2000 -> EFO:0700018

idazucchi commented 1 year ago

this dataset is potentially affected by the deletion of data from ncbi-cloud-data bucket - can you check @arschat ?

arschat commented 1 year ago

Files had already been uploaded in the hca-util area 9d41ab58-c57d-4804-926f-3d63275ed913

arschat commented 10 months ago

Will try to push project despite the ontology stalled.

arschat commented 10 months ago

Authors replied:

We use BGISEQ-500 to sew.

Therefore we should proceed with the other option.

arschat commented 10 months ago

Uploaded file with swapped the insdc_project_accession and insdc_study_accession but fix that with the following code through api:

from hca_ingest.api.ingestapi import IngestApi token = "" api = IngestApi(url="https://api.ingest.archive.data.humancellatlas.org/") api.set_token(token) project_url = "https://api.ingest.archive.data.humancellatlas.org/projects/60068ceec9762f5f0de9f719" headers_json = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token} results = api.get(project_url).json() results['content']['insdc_project_accessions'] = ['SRP281979'] results['content']['insdc_study_accessions'] = ['PRJNA662785'] api.put(project_url, headers=headers_json, json=results)

Seems correct in ingest now.

Waiting for files transfer and if graph valid, it will be ready for secondary review.

arschat commented 10 months ago

Some files in upload-area seems to be invalid, size for some fastq.gz files does not match the size mentioned in ncbi probably duplicated R1 to R2 of same library. I requested a new ncbi cloud transfer.

arschat commented 10 months ago

Downloaded SRA Lite fastqs instead of original files. If I redownload the correct files, I will surpass my monthly limit. Will ask other wrangler to download through their account these SRA files.

arschat commented 9 months ago

Thanks to Ida, correct files have been downloaded.

arschat commented 9 months ago

re-triggered validation using script here, seems stuck in same file (C51_R2.fastq.gz)

arschat commented 9 months ago

Deleted submission & re-submit.

idazucchi commented 9 months ago

Nice job! I have a few suggestions for information you can add

Project

you can add the visualisation portal to the supplementary links

Donor

Specimen

CS

Sequence file

Analysis protocol

Analysis files

arschat commented 9 months ago

About fastq compressing. Data files were very large (~4 TB) and have been transferred through s3 buckets (ncbi to hca-util to upload-prod), and therefore in order to be compressed should be downloaded locally or on EC2, compressed and re-uploaded which would take a lot of time. Since there were also previous problems with the file validation, I didn't want to play too much with those. For those reasons I skipped the fastq compression.

All other changes were submitted (Donor timecourse -> Symptoms to Outcome/ Specimen timecourse -> Symptoms to Sampling date). Did not found other term for TCR files.

Submission is now on Submitted state and I when it will be exported I will sent the import form.

arschat commented 8 months ago

Verified in browser, however, sequencing_protocol.instrument_manufacturer_model needs update when a new OLS is available. BGI MGISEQ-2000 -> EFO:0700018