HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
65 stars 32 forks source link

Map HCA metadata entities and attributes to the Terra Core Data Model [KR1] #1267

Closed lauraclarke closed 3 years ago

lauraclarke commented 4 years ago

Map HCA metadata entities and attributes to the Terra Core Data Model [KR1]

This is a forward looking ticket with two aims.

An light weight mapping of the HCA schema entities ( https://github.com/HumanCellAtlas/metadata-schema) and how they map to the Terra Core data model (https://github.com/DataBiosphere/terra-core-data-model/blob/master/documents/TerraCoreDataModel-Overview_Apr2020.jpg) primarily concerning the required attributes and fields used by the pipelines and browser rather than a full deep dive

A gap analysis which gives us understanding about what is present in either the HCA metadata schema or the Terra Data Model but absent from the other.

The information generated here will be combined with lessons learned during the MVP phase so we can make plans for how to evolve both data models after the MVP phase is over.

Background info

Tasks

Acceptance criteria

NB

The bottom acceptance criteria has been removed as that list of challenges won't be finalised till the MVP process is complete and this ticket should't need to exist for that length of time

mshadbolt commented 4 years ago

Questions for @lauraclarke :

  1. What format would the 'map' be? Would it be acceptable to fill in the 'HCA' columns within Kathy Reinold's spreadsheet above?

  2. How do we capture fields that are part of HCA metadata schema that aren't within the TerraCore model?

Proposed plan:

lauraclarke commented 4 years ago

Thinking about this more, I wonder if this task and https://github.com/HumanCellAtlas/metadata-schema/issues/1269 are very tightly coupled and can't be considered independent of each other.

Happy to merge them

On the second question, my understanding right now is that there can be bigquery tables for entities which aren't represented in the TDR but this would be an important question to ask the folks who support the authoring of bigquery table schema

kreinold commented 4 years ago

If I may weigh in, I would suggest that yes, please do add the HCA column data. Please note this is still a work-in-progress on our side too! We will be updating some of the Terra Core and Terra BigQuery columns as well. This will be a great way to identify issues.

For fields that don't appear in the Terra Core model, it will complicate things on our side to add these since we're trying to make a clean separation between searchable fields supported across datasets and other fields. However, this can be easily accommodated if we agree some way to differentiate those rows. Perhaps colorcoding and putting at the bottom?

Laura, your thoughts?

Kathy Reinold Principal Data Modeler, Data Sciences Platform The Broad Institute of Harvard and MIT 105 Broadway - 359M Cambridge, MA 02142

On Mon, Apr 27, 2020 at 11:29 AM Marion notifications@github.com wrote:

Questions for @lauraclarke https://github.com/lauraclarke :

1.

What format would the 'map' be? Would it be acceptable to fill in the 'HCA' columns within Kathy Reinold's spreadsheet above? 2.

How do we capture fields that are part of HCA metadata schema that aren't within the TerraCore model?

Proposed plan:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HumanCellAtlas/metadata-schema/issues/1267#issuecomment-620058658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKXWSGDQ2Y4D7INCPNRJYITROWQGHANCNFSM4MRXG46Q .

lauraclarke commented 4 years ago

@kreinold I think on the mapping, the current proposal has every HCA entity to get its own table in the TDR schema https://docs.google.com/document/d/1NsibP8g-NeLnksxlcBWQsSj5Zg_uimCDaAF1qB_qkjg/edit#heading=h.aee8p3q0sp0m

kreinold commented 4 years ago

yes, but is this ticket intended to identify the mapping for the future as opposed to the MVP?

lauraclarke commented 4 years ago

We have a lot of do here. I am happy to do a certain level of look ahead mapping but I don't want to invest a lot of time in advanced mapping when we don't know what lessons we will learn with the first import which will influence future decisions.

Understanding the gaps is important but making sure it works for the MVP is more valuable than doing a detailed mapping for a system that hasn't been designed yet.

kreinold commented 4 years ago

Do we want to add HCA Schema fields that are not yet in the Terra Core DM at all then? I'd be happy to learn what key fields are missing from Terra Core, but I understand this is a lower priority. The Terra BigQuery columns that I added were meant to capture the next iteration rather than the MVP BigQuery tables; they may agree, but we will need to confirm with the Ingest team.

lauraclarke commented 4 years ago

Sounds like a discussion with @ESapenaVentura and @ami-day about what the goal is. I think understanding the differences between the HCA schema and the standard Terra Core Data Model is a great plan

I want to be able to draw a line under the comparison during this sprint though so we don't plan how to resolve these differences in the absence of lessons learnt from the MVP if that makes sense

Maybe the best outcome is having a broad understanding of the gaps (so e.g protocol isn't represented as an entity and all our required fields/commonly used fields but not necessarily for a rarely used field such as nutritional_state) and a list of decisions to make/questions to ask in light of the lessons from the MVP and with an improved requirements understanding

How does that sound, I can update the ticket to reflect this

kreinold commented 4 years ago

Excellent.

lauraclarke commented 4 years ago

@ami-day @zperova @kreinold I have updated the ticket description, please let me know if anything is unclear

zperova commented 4 years ago

@ami-day please let me know once you progress on any chunk of work towards this in the spreadsheet, and I will gladly review.

kreinold commented 4 years ago

I'd like to remove columns for EBI Biosamples and NCBI Biosamples. These played a role in our thinking about the data model and I thought they might be useful, but I don't see any value. I'll make those changes. Additionally, at this time the Terra BigQuery Schema mapping is not useful and is incomplete; I'd like to remove these columns to avoid confusion. I will wait 24 hours on this last change in case there are objections.

lauraclarke commented 4 years ago

@kreinold moving forward all newly contributed data will always have an EBI biosamples ids (worth noting that EBI and NCBI biosample databases are peers of each other so the accessions are interoperable) so I don't know if that influences your to desire to remove it

kreinold commented 4 years ago

Thank you; good to know. I don't think we lose anything by removing them here.

ami-day commented 4 years ago

@lauraclarke @kreinold is there an example HCA project file available in the Terra Data Model format? To be very honest, the model definitions where provided are not always clear to me, they appear a bit general or lacking in context (which I am sure is for good reason). However, the mapping is therefore proving to be difficult at times. An example template would really help.

lauraclarke commented 4 years ago

@ami-day the HCA data hasn't been indexed in Terra at all so there won't be any examples

Can you let us know what type of mapping you are finding difficult so we can help?

ami-day commented 4 years ago

It might be easier if I send my first iteration of the sheet once done as there are too many examples to list here, it is not a 'type of mapping' as such, it it more that I don't understand what a lot of the terra data model fields are referring to. In that case, I have entered an HCA field which I think might be appropriate, or entered "N/A". I'll also discuss with @zperova tomorrow, but just thought a template would help.

ami-day commented 4 years ago

Here is my first draft that @zperova will review (we'll have a call tomorrow afternoon): https://docs.google.com/spreadsheets/d/14rMAmiDpqjDWMEl6dm_P8tTyZz9nxy4sjMOv4GrCLXA/edit#gid=1582592642

It looks as though a direct mapping may not be possible due to differences in how the schema is structured and represents relationships between entities (as we have experience converting between different formats in the past, so not a suprise).

@kreinold if we are not able to get an example single cell Terra Data Model template, would it be possible to get a relevant contact/team email or what resources would be a good place to start in order to find examples of filled-in schema, for example: whole genomes metadata; bulk RNA sequencing metadata; microarray data; any biological experimental context which is available? If we were able to search the project/publication and compare alongside the model schema that would leave us in a better place to understand what the schema fields represent in a biological (or better) sequencing context.

ami-day commented 4 years ago

Here is my first draft that @zperova will review (we'll have a call tomorrow afternoon): https://docs.google.com/spreadsheets/d/14rMAmiDpqjDWMEl6dm_P8tTyZz9nxy4sjMOv4GrCLXA/edit#gid=1582592642

It looks as though It looks as though a direct mapping may not be possible due to differences in how the schema is structured and represents relationships between entities (as we have experience converting between different formats in the past, so not a suprise).

@kreinold if we are not able to get an example single cell Terra Data Model template, would it be possible to get a relevant contact/team email or what resources would be a good place to start in order to find examples of filled-in schema, for example: whole genomes metadata; bulk RNA sequencing metadata; microarray data; any biological experimental context which is available? If we were able to search the project/publication and compare alongside the model schema that would leave us in a better place to understand what the schema fields represent in a biological (or better) sequencing context.

kreinold commented 4 years ago

Sorry about not keeping up with the conversation today. Any chance we can meet tomorrow? I'm available on Thursday, 30 Apr from 1:30-3:30 and 4-5pm BST. I'm not quite sure what you mean by a template. But I'll be happy to put together an example.

ami-day commented 4 years ago

@zperova and I have a meeting today to discuss the mapping at 4-5pm BST, so it would be great to meet then. @lauraclarke if you would like to join too, is 4-5pm today ok for you?

lauraclarke commented 4 years ago

I have another meeting at 4pm today but I shouldn't be needed. It will be great to hear an update tomorrow

kreinold commented 4 years ago

4pm BST today is perfect. Would you kindly email an invitation?

ami-day commented 4 years ago

Great, I think @zperova has sent a meeting invite, please let us know if you don't receive it. Thanks!

zperova commented 4 years ago

Outcomes of the meeting: @kreinold will send a template to facilitate mapping @zperova will send suitable times for HCA team to have a meeting on TDR/BQ with Kathy and Dan

ami-day commented 4 years ago

@zperova I saw we had a meeting set up with the TDR schema team but now I can't find it in my calendar, is this correct?

kreinold commented 4 years ago

I will be scheduling a meeting to discuss the Terra Core Data Model to Terra Data Repository BigQuery Schema Transition - objectives and strategies. The dates suggested by Zina are from May 13- 20. I have not yet scheduled this. (If this is the meeting you're thinking about.)

lauraclarke commented 4 years ago

@ami-day @zperova are the documents which are being created by this process living in a shared space? Such as https://drive.google.com/drive/u/0/folders/1EkYghcIUC_SOHkE74PsFLYPW0dWP4MWf

zperova commented 4 years ago

@lauraclarke they are in the shared space but not in the DCP2.0 folder. We have continued to use the Brokering folder. Should everything be saved in the DCP2.0 folder now? In this case, I would create a Brokering folder there.

lauraclarke commented 4 years ago

So the best arrangement of documents for DCP2 planning hasn't been discussed.

That said the brokering folder doesn't feel like the right location for documents about the data model and schema

zperova commented 4 years ago

you are right. There are many things in Brokering that would better suit elsewhere since maintenance mode. It would be good to reorganize once the plan for DCP2.0 documentation is in place. @ami-day could you please move/copy the file here to Metadata/Mapping: https://drive.google.com/drive/folders/1pUqOsDTSgZDnyymX3lxcNAT8uMsUw8Ll
I think this is no longer considered draft now. thank you!

ami-day commented 4 years ago

Ok, what do you mean it is no longer considered draft? I just created a new draft version which gives a better idea for gap analysis, will add both drafts to the above location

zperova commented 4 years ago

@ami-day I was referring to the fact that it is in the Drafts folder - no longer a draft in that sense. Thank you!

ami-day commented 4 years ago

Oh I see, ok. I created a sub-folder in the mapping folder 'Terra Data Model - HCA' as wasn't sure how general this mapping folder is/will be. Here's the link: https://drive.google.com/drive/folders/1UaCt3Gxtb77aXPe_v5NfrzYzPvz3vElO

ami-day commented 4 years ago

I have completed all that I could but I would say it needs a review by both an HCA and Terra Data Model set of eyes.

lauraclarke commented 4 years ago

@ami-day is it accurate to say there are two outstanding tasks. The first is for you and Zina/Marion to meet with the TDR Big Query schema team?

The second is for someone else from the EBI team and @kreinold or someone else from the Broad side to review your two documents

Data model map https://drive.google.com/open?id=14rMAmiDpqjDWMEl6dm_P8tTyZz9nxy4sjMOv4GrCLXA

and gap analysis https://docs.google.com/spreadsheets/d/1GqyMbgHz9NixaNZfzNEwpaSg3pWCXNuIyZrAuvK8zYI/edit#gid=1827185139

We should discuss with @kreinold @zperova @mshadbolt and others how we turn your list of present/absence fields into an assessment if we can meet the user goals with the metadata and if any changes are needed on either side to support the user needs. This feels like it might be better as a new ticket. what do you think?

mshadbolt commented 4 years ago

Was the gap analysis supposed to cover all our schema or only a subset?

zperova commented 4 years ago

@mshadbolt I believe the gap analysis should cover the whole schema, in which case it is not yet finished.

lauraclarke commented 4 years ago

@mshadbolt @zperova The whole schema is a big task.

Do you think it might be better to consider dividing up the work and targeting high priority by use case or some other sectioning so we can move forward on high-value metadata attributes before getting down into the depths of the full schema?

mshadbolt commented 4 years ago

I don't think I understand the goal of the gap analysis or what it would be used for so wouldn't know how to determine a use case, sectioning or what we are moving forward towards.

zperova commented 4 years ago

the division I can think of now is probably to exclude everything related to organoid, cell line, and imaging.

lauraclarke commented 4 years ago

I have underestimated the complexity of this task and think we are likely best doing a use case driven review of how the HCA and TDR data models and schema line up and use that to drive the transition.

An obvious use case would be to support analysis pipeline queries to find projects of a given type to run their pipelines and how they need the metadata presented rather than having part the individual JSON documents.

In general, we need to understand how to bridge the different approaches for the TDR data model and schema and the HCA standards and schema, the crux of that being Terra focuses on discoverability and HCA standards focus on interoperability and re-usability.

Ultimately the HCA terra schema needs to enable the data to be discoverable in the terra interfaces and to metadata accessible in a form that works for the pipelines and browser. Not all the HCA metadata necessarily needs to be part of a Terra Schema if it isn't needed for query but any that isn't available through the core schema needs to be accessible to those users who do need it.

A quick scan of @ami-day's document indicates existing differences that we need to decide when they need to be resolved, looking at the specimen_from_organism tab, the Terra data model doesn't seem to have a disease field for biosamples only donors. We added a disease field for specimens because while a donor either does or doesn't have a disease, not every tissue or organ in a body is affected by the disease, most obviously with cancer and this field is meant to help draw that distinction.

@kreinold do you have thoughts on the best way to proceed here?

kreinold commented 4 years ago

Yes, I think a use-case driven approach will help to focus our efforts. And the Terra Core DM is still a work-in-progress. Specifically, we came to the same conclusion you did (albeit later than you) that we need to capture the disease state of a BioSample (and its subclass SingleCell). We're driving rapidly to finalize our formal V1 very soon. I have collected some use cases from other single cell projects and will be happy to share. However, they may be higher level than what you'd like to use. I'd like to collaborate on this.

ami-day commented 4 years ago

Just to reply to various earlier comments from @lauraclarke and @mshadbolt: this was quite a big task as Laura said, and so I focused on the HCA metadata tabs that I have consistently filled out when creating HCA metadata. I avoided tabs such as "Cell Line" and "Aggregation Protocol" which I have never or rarely used.

A meeting about next steps would be great; I think addressing "how we turn [your] list of present/absence fields into an assessment if we can meet the user goals with the metadata" and making assessments about the mapping from the perspective of defined use cases would be valuable.

lauraclarke commented 4 years ago

I propose we close this issue now and then create a new task which involves

  1. Identifying and prioritizing use cases where metadata being directly queryable from TDR schema is important
  2. Using @ami-day analysis as a starting point for deciding what if any changes are needed to either data model/schema to meet these use cases

Please thumbs up if that seems like a reasonable set of next steps and if so I can close this issue and create the new tasks in the https://github.com/ebi-ait/hca-ebi-wrangler-central repo as this isn't a schema modification task so needs to be tracked elsewhere

ESapenaVentura commented 3 years ago

I don't think we still need this, do we @clairerye ?

clairerye commented 3 years ago

correct, lets close it. We can re-open if we ever come back to it.