Prototype process to FAIRify datasets in ArrayExpress and GEO

lauraclarke commented 4 years ago

The goal of this ticket is to prototype FAIRification process for datasets that in ArrayExpress and GEO but don't exist in either DCP or SCEA

Proposed datasets https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5060/ - Already in SCEA (E-MTAB-5060) Chosen - https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6149/ - Already in SCEA (E-MTAB-6653) Chosen - https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8581/ (Added later by @rays22) https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6653/ https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114156 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109564 Chosen - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131685 Chosen - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136103

Tasks

[x] Check datasets against suitability criteria for DCP https://data.humancellatlas.org/contribute/contributing-data-suitability
- [x] Check datasets against suitability criteria for SCEA (see email from Silvie Fexova 'Criteria for putting a single cell dataset into the SCEA')
- [x] Define which 4 datasets will be targetted by this ticket
- [x] Convert AE datasets to HCA DCP metadata and data standards
- [x] Convert GEO datasets to HCA DCP metadata and data standards
- [x] Validate newly converted GEO ->HCA metadata
- [x] Validate newly converted AE ->HCA metadata
- [ ] Convert newly converted HCA metadata to SCEA metadata standards
- [ ] Validate newly converted SCEA metadata
- [ ] Submit validated metadata and data to SCEA

Acceptance criteria

[x] Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards
[ ] Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards
[ ] Process definition for the conversion
[ ] Proposals for increasing the automation of the process

lauraclarke commented 4 years ago

@ami-day @rolando-ebi @javfg @rays22 This is the initial tasks and acceptance criteria for this issue. We will discuss it further in sprint planning but please let me know here or on slack if you have any other questions

lauraclarke commented 4 years ago

We should avoid taking anything the SCEA already have in hand

https://docs.google.com/spreadsheets/d/1n0Ou-8w_CjpBVQ_j2z1yfAjdBIZkxXR4Vsrqik5o81A/edit?usp=sharing

ami-day commented 4 years ago

@javfg @clairerye @rolando-ebi @rays22 when is a good time this afternoon to meet to discuss? how about 3pm?

javfg commented 4 years ago

We should avoid taking anything the SCEA already have in hand

https://docs.google.com/spreadsheets/d/1n0Ou-8w_CjpBVQ_j2z1yfAjdBIZkxXR4Vsrqik5o81A/edit?usp=sharing

First two proposed datasets are already in SCEA. I updated the Issue with this info.

ami-day commented 4 years ago

Link to explanation of GEO metadata file format: https://www.ncbi.nlm.nih.gov/geo/info/soft.html#format

justincc commented 4 years ago

Whiteboard full and closeups from discussion at 2020-02-19:1530

IMG_20200219_162707 IMG_20200219_162726 IMG_20200219_162728 IMG_20200219_162731

ami-day commented 4 years ago

@lauraclarke, can we assume the above GEO datasets are fully open and not managed access, given they have a GEO accession?

ami-day commented 4 years ago

Check each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards: GEO datasets

GSE136103

dataset is fully open (green)
healthy liver samples (green) & disease (cirrhotic) samples (amber) - ok to convert all as is 1 project
primary tissue (green)
human samples (green) & mouse samples (amber) - ok to convert all as is 1 project
data processing pipeline supported - yes (10x v2)

GSE109564 N.B. same publication as GSE114156 and already converted prior to epic

dataset is fully open (green)
healthy kidney donor biopsy (green)
primary tissue (green)
human sample (green)
data processing pipeline supported - no (InDrops)

GSE114156 N.B. same publication as GSE109564 and already converted prior to epic

dataset is fully open (green)
inflamed kidney transplant biopsy (amber) - ok to convert all as GSE109564 & GSE114156 are 1 project
primary tissue (green)
human sample (green)
data processing pipeline supported - no (InDrops)

GSE131685 - already converted prior to epic

dataset is fully open (green)
normal kidney samples (green)
primary tissue (green)
human samples (green)
data processing pipeline supported - yes (10xv2)

Should we convert GSE114156 and GSE109564 into 1 HCA file since they are derived from the same publication/project?

ami-day commented 4 years ago

Check ArrayExpress datasets against suitability criteria for DCP

[x] E-MTAB-5060 -> E-MTAB-5061 - Single-cell RNA-seq analysis of human pancreas from healthy individuals and type 2 diabetes patients is already in SCEA: https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-5061/results/tsne
- [x] Consent: GREEN?: open access data: ENA - ERP017126, ArrayExpress - E-MTAB-5060
- [x] Sample Type: GREEN: primary endocrine and exocrine cell types
- [x] Health Status: GREEN: healthy individuals and type 2 diabetes patients
- [x] Organism: GREEN Homo sapiens
- [x] Data Processing Pipeline Support: YELLOW: Smart-seq2; single-end reads

The following two sets are related by Citation Phenotype molding of stromal cells in the lung tumor microenvironment. Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, Bassez A, Decaluwé H, Pircher A, Van den Eynde K, Weynand B, Verbeken E, De Leyn P, Liston A, Vansteenkiste J, Carmeliet P, Aerts S, Thienpont B. Nature Medicine (2018), PMID:29988129

[x] E-MTAB-6149 - Single cell sequencing of lung carcinoma
- [x] Consent: GREEN?
- [x] Sample Type: GREEN: primary tissue samples
- [x] Health Status: YELLOW: lung carcinoma and normal tissue adjacent to tumour
- [x] Organism: GREEN: Homo sapiens
- [x] Data Processing Pipeline Support: YELLOW: Drop-seq
[x] E-MTAB-6653 - Single cell sequencing of 3 lung carcinomas is already in SCEA: https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-6653/results/tsne
- [x] Consent: GREEN?: open access data: ENA - ERP110453, ArrayExpress - E-MTAB-6149
- [x] Sample Type: GREEN: primary tissue
- [x] Health Status: YELLOW: diseased samples and normal tissue adjacent to tumour
- [x] Organism: GREEN: Homo sapiens
- [x] Data Processing Pipeline Support: YELLOW Drop-seq

ami-day commented 4 years ago

copy and pasting Silvie's email about SCEA requirements here:

"For now we have analysis pipelines for droplet-based (not just 10x, can process drop-seq as well) and smart-like technologies, so that’s the main selection hurdle.

Secondly, the organism must have a reference genome in ensemble, which obviously is not gonna be an issue for HCA experiments.

Otherwise, we tend to only take experiments that investigate a biological question (not a technical proof-of-principal, comparison of technologies sort of studies, although we may take a subset from those as a baseline description study if the dataset is nice or using an interesting sample).

As to biological replicates, we are not strictly enforcing any limit at the moment but at least three independent biological samples/replicates are preferable (although really not a hard and fast rule).

There is no hard limit for the number of cells per experiment either (I mean the analysis pipeline does have some quality filters of course, including minimal number of cells but it is set very low to be permissive).

Overall, I’d say we’re really quite flexible and inclusive at the moment, so if the technology is right (and the ref genome is there) it’s unlikely the dataset would get rejected. Some of the above is more about prioritising and using our curation time effectively rather than absolute red lines."

ami-day commented 4 years ago

Check each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards (see Silvie email above): GEO datasets

GSE136103

The technology is 10x, drop-seq or a smart-like technology: YES
The organism has a reference genome in ensembl: YES
The experiment investigates a biological question: YES
The study has >= 3 independent biological samples/replicates: YES

GSE109564 - already converted prior to epic

The technology is 10x, drop-seq or a smart-like technology: NO (InDrops)
The organism has a reference genome in ensembl: YES
The experiment investigates a biological question: YES
The study has >= 3 independent biological samples/replicates: NO

GSE114156 - already converted prior to epic

The technology is 10x, drop-seq or a smart-like technology: NO (InDrops)
The organism has a reference genome in ensembl: YES
The experiment investigates a biological question: YES
The study has >= 3 independent biological samples/replicates: NO

GSE131685 - already converted prior to epic

The technology is 10x, drop-seq or a smart-like technology: YES
The organism has a reference genome in ensembl: YES
The experiment investigates a biological question: YES
The study has >= 3 independent biological samples/replicates: YES

lauraclarke commented 4 years ago

@ami-day @rays22 @javfg @rolando-ebi where are we with the datasets which should be targetted by this list. Can all of the top tasks be completed

rolando-ebi commented 4 years ago

@lauraclarke for "Proposals for increasing the automation of the process", I'm still working on mapping utils, currently testing on a sample dataset

lauraclarke commented 4 years ago

Thanks for the update @rolando-ebi I am most interested in the following tasks right now as it feels like they should be done

Check datasets against suitability criteria for SCEA (see email from Silvie Fexova 'Criteria for putting a single cell dataset into the SCEA')
Define which 4 datasets will be targetted by this ticket

javfg commented 4 years ago

I am spending three days in fixing up as much as possible in the HCA->SCEA side, as we agreed. So far I'm trying to improve how the spreadsheets are extracted.

ami-day commented 4 years ago

@lauraclarke, I checked the 4 GEO datasets against the DCP and SCEA suitability criteria, a summary is shown in my comments above. They are all suitable if there are no "absolute red lines" as Silvie said. The only GEO dataset left to convert to HCA format is GSE136103, so that should be in our list of accessions to convert.

I didn't tick the boxes in the task list above, as I am not sure what of the progress regarding the AE accessions. I think we all agreed @rays22 would be working on those, so I will leave him to comment on that, while I continue converting GEO136103.

lauraclarke commented 4 years ago

Thanks @ami-day will GSE109564 and GSE114156 be pushed to SCEA given they can't be analysed?

rays22 commented 4 years ago

These two ArrayExpress datasets meet the suitability criteria for SCEA:

Define which 4 datasets will be targeted by this ticket

I have selected only 3 datasets, because the other candidates are already in SCEA.

@ami-day please, check my selection of datasets to be targeted:

[x] GSE136103
[x] E-MTAB-8581 - A cell atlas of human thymic development defines T cell repertoire formation --> github #180
[x] E-MTAB-6149 - Single cell sequencing of lung carcinoma

lauraclarke commented 4 years ago

So we already have all three datasets in formats which are suitable for DCP and SCEA, that is fantastic. Is someone working on pushing them into the SCEA queue?

Is cell-type annotation available for them? That would be a fantastic stretch goal

javfg commented 4 years ago

The two datasets Ami worked on before the epic are running through the HCA - SCEA converter correctly. But they are marked as "awaiting review", so, do they have to be reviewed before starting that process?

Also I have some doubts about the protocols in one of them I want to discuss with you tomorrow, @ami-day.

ami-day commented 4 years ago

Thanks @ami-day will GSE109564 and GSE114156 be pushed to SCEA given they can't be analysed?

That is something I will need to ask Silvie about. I'll ask and get back. It may be that they will soon be ready to accept other formats, and may still be worth converting but low priority.

lauraclarke commented 4 years ago

Can someone update the ticket with the chosen datasets?

Was any automatic extraction of these datasets from ArrayExpress or GEO tried?

These two tasks have been checked as done, is that true? Are there validation reports both from DCP and SCEA? Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards

ami-day commented 4 years ago

So we already have all three datasets in formats which are suitable for DCP and SCEA, that is fantastic. Is someone working on pushing them into the SCEA queue?

Is cell-type annotation available for them? That would be a fantastic stretch goal

I think there might have been a misunderstanding here.

"Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards": yes, 3 GEO and 2 AE are

"Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards": I haven't converted any of the above GEO datasets to SCEA format (only HCA format), and I understand @javfg is in the process of doing this. @rays22, are the AE datasets you specified for conversion already converted to the HCA format and SCEA format?

lauraclarke commented 4 years ago

So the GEO and ArrayExpress datasets have been validated using ingest against the HCA metadata standards and passed?

javfg commented 4 years ago

So we already have all three datasets in formats which are suitable for DCP and SCEA, that is fantastic. Is someone working on pushing them into the SCEA queue? Is cell-type annotation available for them? That would be a fantastic stretch goal

I think there might have been a misunderstanding here.

"Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards": yes, 3 GEO and 2 AE are

"Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards": I haven't converted any of the above GEO datasets to SCEA format (only HCA format), and I understand @javfg is in the process of doing this. @rays22, are the AE datasets you specified for conversion already converted to the HCA format and SCEA format?

I am not yet working on converting them. I am trying to fix the converter so it benefits from what we learned on converting the previous 5 datasets we did for SCEA. I've made some tests and it looks good.

But as I grabbed the xlsx from the "Ami finished - awaiting to be reviewed" folder in the drive, I am not sure if they are ready to start converting right away. I don't think it is a good idea to convert something that might need changes.

ami-day commented 4 years ago

These two ArrayExpress datasets meet the suitability criteria for SCEA:

[x] E-MTAB-8581 - A cell atlas of human thymic development defines T cell repertoire formation --> github [#180]HumanCellAtlas/hca-data-wrangling#180

[x] E-MTAB-6149 - Single cell sequencing of lung carcinoma

Define which 4 datasets will be targeted by this ticket

I have selected only 3 datasets, because the other candidates are already in SCEA.

@ami-day please, check my selection of datasets to be targeted:

[x] GSE136103

[x] E-MTAB-8581 - A cell atlas of human thymic development defines T cell repertoire formation --> github #180

[x] E-MTAB-6149 - Single cell sequencing of lung carcinoma

@rays22 are the AE accessions that are not included in this list available in both HCA format and SCEA format already ( if suitable according to the selection criteria)?

lauraclarke commented 4 years ago

The boxes

Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards

Should only be checked if the existing in a format which is suitable for each service and have passed all automatic validation we have for submission to that service.

If all these datasets have been validated against the HCA DCP metadata standards using the ingest validation that is fantastic but I would like @rays22 and @ami-day for you to actually tell me what you did to validate against the HCA standards

javfg commented 4 years ago

I think we might be mistaking "suitable for conversion" and "available in a compliant format" in here.

ami-day commented 4 years ago

So the GEO and ArrayExpress datasets have been validated using ingest against the HCA metadata standards and passed?

No, they have not been formally validated, they are awaiting review.

I am going to go and uncheck some of the above boxes, because they are causing some confusion - please can we discuss tomorrow morning because I think maybe how we interpret the wording here is leading to confusion

ami-day commented 4 years ago

The boxes

Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards

Should only be checked if the existing in a format which is suitable for each service and have passed all automatic validation we have for submission to that service.

If all these datasets have been validated against the HCA DCP metadata standards using the ingest validation that is fantastic but I would like @rays22 and @ami-day for you to actually tell me what you did to validate against the HCA standards

we haven't completed this, I think there was confusion between compliance and actual conversion, as @javfg said

lauraclarke commented 4 years ago

Very happy to discuss tomorrow what I meant by the statement

Is compliant with [HCA DCP/SCEA] metadata and data standards

It definitely seems to have created confusion

ami-day commented 4 years ago

Update since I wasn't around for stand-up today: @ESapenaVentura is going to review and go through his feedback for the converted GEO datasets with me later today. Once edited, I will speak to @javfg about running the HCA validator on these.

rays22 commented 4 years ago

I have updated my tasks in Map ArrayExpress format metadata fields to HCA fields #421

ami-day commented 4 years ago

Update:

@rolando-ebi and I have been through a couple of trial and error approaches to automate the GEO -> HCA format process. We found our initial process did not add very much value over manual conversion, due to limitations of the GEO metadata .soft file format.

Our current new approach is outlined in diagrams in the following slides and in pseudocode/code in the following notebook. They illustrate a foundation which we will build upon with further automation and refinement of processes. As it is a generic framework, it could also apply to other dataset types other than GEO.

Slides with conceptual overview: https://drive.google.com/drive/folders/1x7oduxUKfs6x3grWla58tsMNEi1hL_EY

Notebook with pseudocode/code: https://colab.research.google.com/drive/1FBa7r1ed5pW0DVjgzAKN7Nqpt09sgR27#scrollTo=aDUkKDmFXAeQ

ami-day commented 4 years ago

Update:

I have added a new repository geo_to_hca here: https://github.com/HumanCellAtlas/geo_to_hca

This repository contains a python script and input files which can be used for semi-automated conversion of GEO metadata to HCA metadata standard; a part filled HCA excel spreadsheet is the output.

The script is a 1st version of the prototype process which works for some GEO accessions but not all; @ESapenaVentura and I are working to resolve this. I will create tickets for issues which need to be resolved and/or improvements which could be made for Version 2 and encourage others to do so too if interested.

@rolando-ebi also contributed to this code; I copied over code from the shared notebook into the python script seen here.

Of course all Wranglers and Developers are welcome to suggest feedback and upgrade the code to make it better.

@lauraclarke I am not sure if this or the above linked slide deck is a reasonable outcome for the tasks: "Process definition for the conversion" and "Proposals for increasing the automation of the process". What would be an ideal outcome?

lauraclarke commented 4 years ago

@rays22 @ami-day can this be considered finished now?

rays22 commented 4 years ago

I think we have not completed the Process definition for the conversion step. We need to document the conversion steps and tools in the process.

ami-day commented 4 years ago

@lauraclarke @rays22 I have written up the documentation for the GEO -> HCA and HCA -> SCEA conversion processes. I am not sure if there is documentation for AE -> HCA conversion? It would be good to have the links to and possibly all the documents in 1 place (as copies if needed) which is something I have mentioned might be good to link to the processes diagram @Marion has created.

ami-day commented 4 years ago

Another link to add would be the instructions on how to upload and validate new HCA metadata using the HCA ingest portal (shared by Justin&Alegria)

rays22 commented 4 years ago

I am not sure if there is documentation for AE -> HCA conversion?

@ami-day : The steps of the AE -> HCA conversion are the same as the ones you do when you fill in the HCA spreadsheet. They are covered in the documents:

How to use the HCA spreadsheet template generator and
HCA Metadata Spreadsheet Guide for Wranglers. These documents might need to be updated though.

As I understand the AE -> HCA data flow will be at a very low volume. However, if the AE to HCA data flow were expected to be higher than what I think, then partial automation of the conversion would require further development time.

It would be good to have the links to and possibly all the documents in 1 place (as copies if needed) which is something I have mentioned might be good to link to the processes diagram @marion has created.

Yes, that would be nice.

lauraclarke commented 4 years ago

@rays22 the volume of AE to HCA submissions will be driven by how much people submit to AE rather than submitting to HCA DCP directly. We also need to do an assessment via AE of how many datasets suitable for DCP2 are in AE so we can prioritize their import.

Do you think we could close this epic and add a ticket to the backlog which reflected the outstanding assessment/documentation work that is needed for AE to HCA imports?

rays22 commented 4 years ago

Do you think we could close this epic and add a ticket to the backlog which reflected the outstanding assessment/documentation work that is needed for AE to HCA imports?

@lauraclarke , OK, I will make a ticket for the outstanding work so we can close the ticket.

lauraclarke commented 4 years ago

Great, if you can point me to your new ticket when it is done and then close this one

thanks

rays22 commented 4 years ago

@lauraclarke , I have made a new ticket for the outstanding tasks: Assessment of how many suitable AE datasets are available for DCP2 and process definition for AE->HCA metadata conversion #1272

HumanCellAtlas / metadata-schema