GSE145926 - Covid19BALFLandscape

arschat commented 1 year ago

Project short name: Covid19BALFLandscape

Primary Wrangler:

Arsenios

Secondary Wrangler:

Ida

Associated files

Google Drive: folder

Published study links

Paper: Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19
Accessioned data: GSE145926
ingest: 08fb10df-32e5-456c-9882-e33fcd49077a

Key Events

[x] Convert published metadata to HCA spreadsheet
[x] Manually curate dataset to meet HCA metadata standard
[x] Collect any matrix and cell-type annotation files
[x] Are the analysis files suitable for CellxGene? If something is missing get in touch with the authors to request it
[x] Upload sheet to validate metadata
[x] Transfer raw files to ingest to validate data files
[x] Check linking using ingest graph validator
[x] Ask the Secondary Wrangler for an end-to-end review of the project. Ask the Expertise Wrangler to review specific tabs if needed
[ ] Submit dataset to Production
[ ] Complete the Export SOP
[ ] Convert project data to SCEA format following the SCEA conversion SOP if appropriate

arschat commented 1 year ago

RDS analysis file, includes data from another publicly available dataset (here), but not wrangled in DCP & not eligible.

I will try to find metadata from specific donor and include in spreadsheet.

arschat commented 1 year ago

No metadata available for the extra donor. Used a dummy donor/specimen & CS for this.

There is ambiguity about the sequence machine used.

The constructed library was sequenced on a BGI MGISEQ-2000 or Illumina platform.

In GEO, only BGI is mentioned, however it is not in the EFO ontology for high throughput sequencers. It exists in the GENEPIO ontology http://purl.obolibrary.org/obo/GENEPIO_0100144.

An ontology request should be made.

arschat commented 1 year ago

CellxGene wrangling requirements

[X] Raw count matrix: nCoV.rds
[X] Normalised count matrix: nCoV.rds
[X] cell type annotations (cell type linked to barcode): github
[ ] ENSEMBL ID of genes: MISSING
[X] x-y coordinates (tsne, umap, pca): nCoV.rds

(genes are in gene symbol format in the count matrices)

arschat commented 1 year ago

Ontology request needed, leave for next release.

arschat commented 1 year ago

Ontology request made but even if new term is added, OLS update should be made to proceed.

arschat commented 1 year ago

Ontology term added, will be available in EFO in next release. BGI MGISEQ-2000 -> EFO:0700018

idazucchi commented 1 year ago

this dataset is potentially affected by the deletion of data from ncbi-cloud-data bucket - can you check @arschat ?

arschat commented 1 year ago

Files had already been uploaded in the hca-util area 9d41ab58-c57d-4804-926f-3d63275ed913

arschat commented 1 year ago

Will try to push project despite the ontology stalled.

Sent email to authors to verify that BGI MGISEQ-2000 is the sequencer used.
Option to add high throughput sequencer temporarily until the ontology is pushed.

arschat commented 1 year ago

Authors replied:

We use BGISEQ-500 to sew.

Therefore we should proceed with the other option.

arschat commented 1 year ago

Uploaded file with swapped the insdc_project_accession and insdc_study_accession but fix that with the following code through api:

from hca_ingest.api.ingestapi import IngestApi token = "" api = IngestApi(url="https://api.ingest.archive.data.humancellatlas.org/") api.set_token(token) project_url = "https://api.ingest.archive.data.humancellatlas.org/projects/60068ceec9762f5f0de9f719" headers_json = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token} results = api.get(project_url).json() results['content']['insdc_project_accessions'] = ['SRP281979'] results['content']['insdc_study_accessions'] = ['PRJNA662785'] api.put(project_url, headers=headers_json, json=results)

Seems correct in ingest now.

Waiting for files transfer and if graph valid, it will be ready for secondary review.

arschat commented 1 year ago

Some files in upload-area seems to be invalid, size for some fastq.gz files does not match the size mentioned in ncbi probably duplicated R1 to R2 of same library. I requested a new ncbi cloud transfer.

arschat commented 1 year ago

Downloaded SRA Lite fastqs instead of original files. If I redownload the correct files, I will surpass my monthly limit. Will ask other wrangler to download through their account these SRA files.

arschat commented 12 months ago

Thanks to Ida, correct files have been downloaded.

[X] move files to hca-util upload area
[X] sync files to hca prod area

arschat commented 12 months ago

re-triggered validation using script here, seems stuck in same file (C51_R2.fastq.gz)

arschat commented 12 months ago

Deleted submission & re-submit.

[x] Metadata valid
[x] Graph valid Ready for sec review

idazucchi commented 11 months ago

Nice job! I have a few suggestions for information you can add

Project

you can add the visualisation portal to the supplementary links

Donor

healthy controls can be marked as disease normal
you can add to the description the disease outcome
you can also fill in the timecourse with the disease duration

Specimen

healthy controls can be marked as disease normal
you can remove the disease from the adjacent diseases - it's intended for use with topical diseases, like tumors or lesions, and the systemic diseases are already recorded at the donor level
for preservation method it sounds to me like the samples were processed fresh

Approximately 20 ml of BALF was obtained and placed on ice. BALF was processed within 2 h and all operations were performed in a BSL-3 laboratory.
you can add the time of collection for the BALF and the timecourse for the time elapsed between the symptoms onset and the sampling (or the hospitalisation)

CS

one cell suspension per library - nice - I'm still getting used to it

Sequence file

you could compress the fastq files

Analysis protocol

can remove gene filtering, cell filtering and doublet removal from raw_matrix_generation

Analysis files

can we make sure we're all using gene expression / count matrix as raw/processed? maybe we can raise this at the next wrangler call
do we have a better term to describe TCR files rather than sample annotation? maybe we need to request one

arschat commented 11 months ago

About fastq compressing. Data files were very large (~4 TB) and have been transferred through s3 buckets (ncbi to hca-util to upload-prod), and therefore in order to be compressed should be downloaded locally or on EC2, compressed and re-uploaded which would take a lot of time. Since there were also previous problems with the file validation, I didn't want to play too much with those. For those reasons I skipped the fastq compression.

All other changes were submitted (Donor timecourse -> Symptoms to Outcome/ Specimen timecourse -> Symptoms to Sampling date). Did not found other term for TCR files.

Submission is now on Submitted state and I when it will be exported I will sent the import form.

arschat commented 10 months ago

Verified in browser, however, sequencing_protocol.instrument_manufacturer_model needs update when a new OLS is available. BGI MGISEQ-2000 -> EFO:0700018

arschat commented 5 days ago

Since ols update is completed, we can now add the correct sequencer information.

arschat commented 5 days ago

[x] Added correct sequencer of protocol through the UI
[x] metadata & graph valid
[x] export metadata
[x] import form sent

ebi-ait / hca-ebi-wrangler-central