Store link from CAS JSON to matrix file

dosumis commented 8 months ago

This was a critical part of the original specification - CAS JSON files must link to the data they annotate - but somehow got lost prior to appearing on the open CAS. The specification must not be tied to a specific resource. In some cases, CellXGene is a obvious source. CAP may choose to use its own hosting. BICAN may need some way to point to AWS hosted files.

Option 1. URL

Proposed field: dataset_link: "A URL that resolved to a cell by gene matrix file containing expression data for all of the cells annotated in this file and no additional cells.." # Do we specify AnnData format here?

The danger of this is that (a) it may not be reliably persistent (b) this does not work for cases where retrieval needs to be via an API.

Option 2 A CURIE (or at least a CURIE style ID).

Proposed field: dataset_id: "A namespaced ID that can be used to download a cell by gene matrix file containing expression data for all of the cells annotated in this file and no additional cells."

This will require some accompanying document with details of how to resolve to a URL or use in an API call. The former can be specified via JSON-LD context. @hkir-dev do you know if possible to store details of API resolution as text in JSON-LD context doc?

Example: cellxgene_census supports retrieval of dataset IDs

https://chanzuckerberg.github.io/cellxgene-census/notebooks/api_demo/census_datasets.html#Fetching-the-datasets-table

The API can be used to retrieve dataset directly in the form of an AnnData file with CxG schema fields only, or the original dataset with author annotations, or a URL pointing to that original dataset:

TBD - which of these options should we support? Should we support both?

dosumis commented 8 months ago

@hkir-dev do you know if any standard way to to store details of API resolution as text in JSON-LD context doc?

hkir-dev commented 7 months ago

Some initial research results related to this topic:

A context key should be simple strings and cannot contain a ":" [1]. PURLs should be represented with underscores in this case. Compact IRIs are supported as key but keys of the form of a compact IRI MUST NOT expand to an IRI other than the expansion of the key itself [2].

A context value can be a map composed of zero or more keys from @id, @reverse, @type, @language, @container, @context, @prefix, @propagate, or @protected. An expanded term definition (context value) SHOULD NOT contain any other keys [2]. So we cannot add our custom keys to the context for API resolution, we should use the existing constructs.

Only way I can think of for API resolution is providing Restful URLs. Similarly, Amazon S3 URL style access [3] can be used to resolve remote files:

https://bucket-name.s3.region-code.amazonaws.com/key-name
example: https://DOC-EXAMPLE-BUCKET1.s3.us-west-2.amazonaws.com/puppy.png

If authentication is needed, there is not an official way to do it via URL. But, probably authenticated access to these files is against FAIRness anyways.

[1] https://www.w3.org/TR/json-ld11/#the-context [2] https://www.w3.org/TR/json-ld11/#expanded-term-definition [3] https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html#path-style-url-ex

dosumis commented 7 months ago

Communication from Jason Hilton on CxG IDs:

There are two main entities in CELLxGENE: Dataset – a single submitted .h5ad that then is converted to a single visualization, Seurat object, etc. Collection – a ‘study’ that holds one or more Datasets, typically 1-to-1 with Publication. Every Dataset needs to belong to a Collection

Each Collection or Dataset has an ID (uuid) which is established at the time that the object is created and will be stable. The act of Publishing does not change the ID. So for Collections, all we need is a Title, Description, and Contact to create, so we know those IDs/URLs early & they almost never change from creation-to-Publish (would probably only happen if we mistakenly created two Collections for a single study and needed to wipe one of them). Datasets, on the other hand are commonly removed/added prior to Publication so it’s more difficult to trust those IDs until Publish. Each entity also has a version ID (uuid) which will be updated with any post-Publish updates.

I know there are multiple APIs plus the Census API, so depending on which you’re using, I’m sure we can track which IDs are available in each and how to locate them.

Using the Siletti et al submission as an example… Collection URL: https://cellxgene.cziscience.com/collections/283d65eb-dd53-496d-adb7-7570c7caa443 [cellxgene.cziscience.com] collection_id: 283d65eb-dd53-496d-adb7-7570c7caa443 There are 138 Datasets in this Collection. Some are based on the cell type groupings and some based on the tissue sampled, so there is observation overlap amongst them. The top one is “All neurons” Explore visuzliation URL: https://cellxgene.cziscience.com/e/8e10f1c4-8e98-41e5-b65f-8cd89a887122.cxg/ [cellxgene.cziscience.com] dataset_id: 8e10f1c4-8e98-41e5-b65f-8cd89a887122

dosumis commented 7 months ago

Proposal:

Store as CURIE-like string
@hkir-dev to look further into whether we can have single IRI including region that we can specifiy in context
Additional filed provides text and links for how to resolve.
TBD - how to support for version IDs

dosumis commented 7 months ago

@lydiang

- concerned that we might need to link to multiple anndata files, e.g. ABC atlas has 20 separate AnnData files. This may require moving the reference down to the cell set level.
- Also the same cell can live in multiple files (we need this in a matrix store e.g. CellXGene.)

cellannotation / cell-annotation-schema