HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
64 stars 32 forks source link

Update metadata in loom files for March release #1219

Closed zperova closed 4 years ago

zperova commented 4 years ago

Description

As a member of the HCA Metadata&Wrangler team who is responsible for accurate data and metadata presented on the Data Portal for the data consumers, I would like to update all sample and experimental metadata for the 13 human datasets chosen for March release in the form suitable for downstream use.

The March release will present clusters per organ per project with cell annotations (ideally ontologized) for the chosen 13 human projects currently in the Data Portal. This document lists the projects and tracks progress on each of them.

The format for updates is loom obtained from the Data Portal following instructions

Deadline: 7 FEB 2020

Acceptance Criteria

zperova commented 4 years ago

All single organ looms are in the s3 bucket

dosumis commented 4 years ago

Do you have a proposal for how clusters will be represented and annotated in Loom files?

zperova commented 4 years ago

all looms except for the Fetal-Maternak interface are in the s3 bucket. Having problems getting the organ parsed but not technology parsed matrices for this dataset. Confirm with pipelines about splitting per technology.

zperova commented 4 years ago

@dosumis the current plan is to add column attributes for the cluster ID, annotated cell type and ontology to the loom files. what do you think about this?

KrisDavie commented 4 years ago

Hey @zperova, we came across this issue and thought that we may be able to provide some input here.

In the context of the Fly Cell Atlas and as one of the lead developers of the single cell analysis viewer SCope (http://scope.aertslab.org), we also use the loom format for our main data store. We came across the same problem of storing this kind of metadata in the loom files, and have developed a JSON based schema which we now embed in the loom file. Initially, we ran into issues due to the small space available for global attributes in the loom file, but after some discussion with Sten Linnarsson, the loom api was updated to provide unlimited space here (as of loompy v3). As it stands this means that we are now able to store the following in the Metadata JSON:

We are also in the process of adding support for collaborative annotation to SCope, which would extend this schema to include:

This annotation will be supported by the EBIs ontology lookup service API, embedding a widget which allows searching for an ontologized term (required for the annotation).

We are very open to suggestions on how to modify or further this schema to make it as applicable and usable as possible for the rest of the single cell community, including the HCA. We have floated ideas about possibly also storing analysis level metadata, including the software/pipelines and version used in the analysis, we think there are a lot of opportunities to store data here that would be extremely useful to be alongside the data and analysis.

The one other issue we currently have is that we store some non-standard data types in the loom file (such as named 2D arrays as attributes), we are planning on moving away from this and back to the standard loom format, further developing the same JSON metadata object to store links between numeric indexes and names of columns/rows of these matrices.

For now, the developers of ASAP (who already provide an interface for ingesting looms from the HCA) are working on incorporating this metadata for their own use, which would eventually allow bidirectional transfer of loom files (with both data and rich metadata) with SCope, and we at the Fly Cell Atlas are adopting this schema (and SCope) for storing and visualising our datasets.

We have a JSON schema and some test data on our GitHub page which you could check out.

Let us know what you think and if you think we could come together to form a real standard usable for the entire single cell community.

lauraclarke commented 4 years ago

This sounds like a great plan to investigate in the future. If adding all that info using standard loom metadata practices is difficult now I would advocate for this release to use the ontology labels and keep the label to ontology term mapping (maybe release it as an ancillary file) but we are unlikely to have time to investigate something more complex right now.

KrisDavie commented 4 years ago

Currently the metadata that we store is completely within the standard loom conventions, v3 increases the amount of data that can be stored, but v2 is perfectly able to store this too (with a 16 kB limit).

The extra information in the loom that we store specific to SCope is outside of the standard format at the moment, but is completely separate from the metadata.

For reference, we store this JSON string under attrs with the key 'MetaData'

lauraclarke commented 4 years ago

@KrisDavie for context we are producing a data release on a very tight timeframe (6-8 weeks) so my current desire is to add limited new steps which might cause unexpected delays.

On the longer term, if we can work together to drive this sort of standardization in the community, that would be a great idea.

KrisDavie commented 4 years ago

I completely understand, don't worry, I just wanted to make sure that I was clear.

We are all for driving this kind of standardisation, so would love to keep in contact as things develop and as both Atlases move forwards!

ESapenaVentura commented 4 years ago

All loom files have been updated with the latest metadata.

I will update the spreadsheet with the locations when possible, but they are in the indicated S3 buckets, in folders that end in _update

ESapenaVentura commented 4 years ago

Only thing left is to communicate with the broad team

zperova commented 4 years ago

Closing as this is done. Thank you @ESapenaVentura