HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
64 stars 32 forks source link

Prototype process to FAIRify datasets in ArrayExpress and GEO #1242

Closed lauraclarke closed 4 years ago

lauraclarke commented 4 years ago

The goal of this ticket is to prototype FAIRification process for datasets that in ArrayExpress and GEO but don't exist in either DCP or SCEA

Proposed datasets https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5060/ - Already in SCEA (E-MTAB-5060) Chosen - https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6149/ - Already in SCEA (E-MTAB-6653) Chosen - https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8581/ (Added later by @rays22) https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6653/ https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114156 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109564 Chosen - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131685 Chosen - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136103

Tasks

Acceptance criteria

lauraclarke commented 4 years ago

@ami-day @rolando-ebi @javfg @rays22 This is the initial tasks and acceptance criteria for this issue. We will discuss it further in sprint planning but please let me know here or on slack if you have any other questions

lauraclarke commented 4 years ago

We should avoid taking anything the SCEA already have in hand

https://docs.google.com/spreadsheets/d/1n0Ou-8w_CjpBVQ_j2z1yfAjdBIZkxXR4Vsrqik5o81A/edit?usp=sharing

ami-day commented 4 years ago

@javfg @clairerye @rolando-ebi @rays22 when is a good time this afternoon to meet to discuss? how about 3pm?

javfg commented 4 years ago

We should avoid taking anything the SCEA already have in hand

https://docs.google.com/spreadsheets/d/1n0Ou-8w_CjpBVQ_j2z1yfAjdBIZkxXR4Vsrqik5o81A/edit?usp=sharing

First two proposed datasets are already in SCEA. I updated the Issue with this info.

ami-day commented 4 years ago

Link to explanation of GEO metadata file format: https://www.ncbi.nlm.nih.gov/geo/info/soft.html#format

justincc commented 4 years ago

Whiteboard full and closeups from discussion at 2020-02-19:1530

IMG_20200219_162707 IMG_20200219_162726 IMG_20200219_162728 IMG_20200219_162731

ami-day commented 4 years ago

@lauraclarke, can we assume the above GEO datasets are fully open and not managed access, given they have a GEO accession?

ami-day commented 4 years ago

Check each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards: GEO datasets

GSE136103

GSE109564 N.B. same publication as GSE114156 and already converted prior to epic

GSE114156 N.B. same publication as GSE109564 and already converted prior to epic

GSE131685 - already converted prior to epic

Should we convert GSE114156 and GSE109564 into 1 HCA file since they are derived from the same publication/project?

ami-day commented 4 years ago

Check ArrayExpress datasets against suitability criteria for DCP

The following two sets are related by Citation Phenotype molding of stromal cells in the lung tumor microenvironment. Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, Bassez A, Decaluwé H, Pircher A, Van den Eynde K, Weynand B, Verbeken E, De Leyn P, Liston A, Vansteenkiste J, Carmeliet P, Aerts S, Thienpont B. Nature Medicine (2018), PMID:29988129

ami-day commented 4 years ago

copy and pasting Silvie's email about SCEA requirements here:

"For now we have analysis pipelines for droplet-based (not just 10x, can process drop-seq as well) and smart-like technologies, so that’s the main selection hurdle.

Secondly, the organism must have a reference genome in ensemble, which obviously is not gonna be an issue for HCA experiments.

Otherwise, we tend to only take experiments that investigate a biological question (not a technical proof-of-principal, comparison of technologies sort of studies, although we may take a subset from those as a baseline description study if the dataset is nice or using an interesting sample).

As to biological replicates, we are not strictly enforcing any limit at the moment but at least three independent biological samples/replicates are preferable (although really not a hard and fast rule).

There is no hard limit for the number of cells per experiment either (I mean the analysis pipeline does have some quality filters of course, including minimal number of cells but it is set very low to be permissive).

Overall, I’d say we’re really quite flexible and inclusive at the moment, so if the technology is right (and the ref genome is there) it’s unlikely the dataset would get rejected. Some of the above is more about prioritising and using our curation time effectively rather than absolute red lines."

ami-day commented 4 years ago

Check each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards (see Silvie email above): GEO datasets

GSE136103

GSE109564 - already converted prior to epic

GSE114156 - already converted prior to epic

GSE131685 - already converted prior to epic

lauraclarke commented 4 years ago

@ami-day @rays22 @javfg @rolando-ebi where are we with the datasets which should be targetted by this list. Can all of the top tasks be completed

rolando-ebi commented 4 years ago

@lauraclarke for "Proposals for increasing the automation of the process", I'm still working on mapping utils, currently testing on a sample dataset

lauraclarke commented 4 years ago

Thanks for the update @rolando-ebi I am most interested in the following tasks right now as it feels like they should be done

javfg commented 4 years ago

I am spending three days in fixing up as much as possible in the HCA->SCEA side, as we agreed. So far I'm trying to improve how the spreadsheets are extracted.

ami-day commented 4 years ago

@lauraclarke, I checked the 4 GEO datasets against the DCP and SCEA suitability criteria, a summary is shown in my comments above. They are all suitable if there are no "absolute red lines" as Silvie said. The only GEO dataset left to convert to HCA format is GSE136103, so that should be in our list of accessions to convert.

I didn't tick the boxes in the task list above, as I am not sure what of the progress regarding the AE accessions. I think we all agreed @rays22 would be working on those, so I will leave him to comment on that, while I continue converting GEO136103.

lauraclarke commented 4 years ago

Thanks @ami-day will GSE109564 and GSE114156 be pushed to SCEA given they can't be analysed?

rays22 commented 4 years ago

These two ArrayExpress datasets meet the suitability criteria for SCEA:

Define which 4 datasets will be targeted by this ticket

I have selected only 3 datasets, because the other candidates are already in SCEA.

@ami-day please, check my selection of datasets to be targeted:

lauraclarke commented 4 years ago

So we already have all three datasets in formats which are suitable for DCP and SCEA, that is fantastic. Is someone working on pushing them into the SCEA queue?

Is cell-type annotation available for them? That would be a fantastic stretch goal

javfg commented 4 years ago

The two datasets Ami worked on before the epic are running through the HCA - SCEA converter correctly. But they are marked as "awaiting review", so, do they have to be reviewed before starting that process?

Also I have some doubts about the protocols in one of them I want to discuss with you tomorrow, @ami-day.

ami-day commented 4 years ago

Thanks @ami-day will GSE109564 and GSE114156 be pushed to SCEA given they can't be analysed?

That is something I will need to ask Silvie about. I'll ask and get back. It may be that they will soon be ready to accept other formats, and may still be worth converting but low priority.

lauraclarke commented 4 years ago

Can someone update the ticket with the chosen datasets?

Was any automatic extraction of these datasets from ArrayExpress or GEO tried?

These two tasks have been checked as done, is that true? Are there validation reports both from DCP and SCEA? Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards

ami-day commented 4 years ago

So we already have all three datasets in formats which are suitable for DCP and SCEA, that is fantastic. Is someone working on pushing them into the SCEA queue?

Is cell-type annotation available for them? That would be a fantastic stretch goal

I think there might have been a misunderstanding here.

"Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards": yes, 3 GEO and 2 AE are

"Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards": I haven't converted any of the above GEO datasets to SCEA format (only HCA format), and I understand @javfg is in the process of doing this. @rays22, are the AE datasets you specified for conversion already converted to the HCA format and SCEA format?

lauraclarke commented 4 years ago

So the GEO and ArrayExpress datasets have been validated using ingest against the HCA metadata standards and passed?

javfg commented 4 years ago

So we already have all three datasets in formats which are suitable for DCP and SCEA, that is fantastic. Is someone working on pushing them into the SCEA queue? Is cell-type annotation available for them? That would be a fantastic stretch goal

I think there might have been a misunderstanding here.

"Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards": yes, 3 GEO and 2 AE are

"Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards": I haven't converted any of the above GEO datasets to SCEA format (only HCA format), and I understand @javfg is in the process of doing this. @rays22, are the AE datasets you specified for conversion already converted to the HCA format and SCEA format?

I am not yet working on converting them. I am trying to fix the converter so it benefits from what we learned on converting the previous 5 datasets we did for SCEA. I've made some tests and it looks good.

But as I grabbed the xlsx from the "Ami finished - awaiting to be reviewed" folder in the drive, I am not sure if they are ready to start converting right away. I don't think it is a good idea to convert something that might need changes.

ami-day commented 4 years ago

These two ArrayExpress datasets meet the suitability criteria for SCEA:

Define which 4 datasets will be targeted by this ticket

I have selected only 3 datasets, because the other candidates are already in SCEA.

@ami-day please, check my selection of datasets to be targeted:

@rays22 are the AE accessions that are not included in this list available in both HCA format and SCEA format already ( if suitable according to the selection criteria)?

lauraclarke commented 4 years ago

The boxes

Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards

Should only be checked if the existing in a format which is suitable for each service and have passed all automatic validation we have for submission to that service.

If all these datasets have been validated against the HCA DCP metadata standards using the ingest validation that is fantastic but I would like @rays22 and @ami-day for you to actually tell me what you did to validate against the HCA standards

javfg commented 4 years ago

I think we might be mistaking "suitable for conversion" and "available in a compliant format" in here.

ami-day commented 4 years ago

So the GEO and ArrayExpress datasets have been validated using ingest against the HCA metadata standards and passed?

No, they have not been formally validated, they are awaiting review.

I am going to go and uncheck some of the above boxes, because they are causing some confusion - please can we discuss tomorrow morning because I think maybe how we interpret the wording here is leading to confusion

ami-day commented 4 years ago

The boxes

Each of the specified datasets is available in a form that is compliant with HCA DCP metadata and data standards Each of the specified datasets is available in a form that is compliant with SCEA metadata and data standards

Should only be checked if the existing in a format which is suitable for each service and have passed all automatic validation we have for submission to that service.

If all these datasets have been validated against the HCA DCP metadata standards using the ingest validation that is fantastic but I would like @rays22 and @ami-day for you to actually tell me what you did to validate against the HCA standards

we haven't completed this, I think there was confusion between compliance and actual conversion, as @javfg said

lauraclarke commented 4 years ago

Very happy to discuss tomorrow what I meant by the statement

Is compliant with [HCA DCP/SCEA] metadata and data standards

It definitely seems to have created confusion

ami-day commented 4 years ago

Update since I wasn't around for stand-up today: @ESapenaVentura is going to review and go through his feedback for the converted GEO datasets with me later today. Once edited, I will speak to @javfg about running the HCA validator on these.

rays22 commented 4 years ago

I have updated my tasks in Map ArrayExpress format metadata fields to HCA fields #421

ami-day commented 4 years ago

Update:

@rolando-ebi and I have been through a couple of trial and error approaches to automate the GEO -> HCA format process. We found our initial process did not add very much value over manual conversion, due to limitations of the GEO metadata .soft file format.

Our current new approach is outlined in diagrams in the following slides and in pseudocode/code in the following notebook. They illustrate a foundation which we will build upon with further automation and refinement of processes. As it is a generic framework, it could also apply to other dataset types other than GEO.

Slides with conceptual overview: https://drive.google.com/drive/folders/1x7oduxUKfs6x3grWla58tsMNEi1hL_EY

Notebook with pseudocode/code: https://colab.research.google.com/drive/1FBa7r1ed5pW0DVjgzAKN7Nqpt09sgR27#scrollTo=aDUkKDmFXAeQ

ami-day commented 4 years ago

Update:

I have added a new repository geo_to_hca here: https://github.com/HumanCellAtlas/geo_to_hca

This repository contains a python script and input files which can be used for semi-automated conversion of GEO metadata to HCA metadata standard; a part filled HCA excel spreadsheet is the output.

The script is a 1st version of the prototype process which works for some GEO accessions but not all; @ESapenaVentura and I are working to resolve this. I will create tickets for issues which need to be resolved and/or improvements which could be made for Version 2 and encourage others to do so too if interested.

@rolando-ebi also contributed to this code; I copied over code from the shared notebook into the python script seen here.

Of course all Wranglers and Developers are welcome to suggest feedback and upgrade the code to make it better.

@lauraclarke I am not sure if this or the above linked slide deck is a reasonable outcome for the tasks: "Process definition for the conversion" and "Proposals for increasing the automation of the process". What would be an ideal outcome?

lauraclarke commented 4 years ago

@rays22 @ami-day can this be considered finished now?

rays22 commented 4 years ago

I think we have not completed the Process definition for the conversion step. We need to document the conversion steps and tools in the process.

ami-day commented 4 years ago

@lauraclarke @rays22 I have written up the documentation for the GEO -> HCA and HCA -> SCEA conversion processes. I am not sure if there is documentation for AE -> HCA conversion? It would be good to have the links to and possibly all the documents in 1 place (as copies if needed) which is something I have mentioned might be good to link to the processes diagram @Marion has created.

ami-day commented 4 years ago

Another link to add would be the instructions on how to upload and validate new HCA metadata using the HCA ingest portal (shared by Justin&Alegria)

rays22 commented 4 years ago

I am not sure if there is documentation for AE -> HCA conversion?

@ami-day : The steps of the AE -> HCA conversion are the same as the ones you do when you fill in the HCA spreadsheet. They are covered in the documents:

As I understand the AE -> HCA data flow will be at a very low volume. However, if the AE to HCA data flow were expected to be higher than what I think, then partial automation of the conversion would require further development time.

It would be good to have the links to and possibly all the documents in 1 place (as copies if needed) which is something I have mentioned might be good to link to the processes diagram @marion has created.

Yes, that would be nice.

lauraclarke commented 4 years ago

@rays22 the volume of AE to HCA submissions will be driven by how much people submit to AE rather than submitting to HCA DCP directly. We also need to do an assessment via AE of how many datasets suitable for DCP2 are in AE so we can prioritize their import.

Do you think we could close this epic and add a ticket to the backlog which reflected the outstanding assessment/documentation work that is needed for AE to HCA imports?

rays22 commented 4 years ago

Do you think we could close this epic and add a ticket to the backlog which reflected the outstanding assessment/documentation work that is needed for AE to HCA imports?

@lauraclarke , OK, I will make a ticket for the outstanding work so we can close the ticket.

lauraclarke commented 4 years ago

Great, if you can point me to your new ticket when it is done and then close this one

thanks

rays22 commented 4 years ago

@lauraclarke , I have made a new ticket for the outstanding tasks: Assessment of how many suitable AE datasets are available for DCP2 and process definition for AE->HCA metadata conversion #1272