HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
65 stars 32 forks source link

targets.json should be replaced by probes.json #813

Closed hewgreen closed 5 years ago

hewgreen commented 5 years ago

What is the proposed name of the new schema?

  1. I suggest changing the name of the module target.json to probe.json.
  2. Rename 'reagent_name' to 'probe_label' this should be required
  3. Rename 'molecule_id' to 'target_molecule_label' and add multiple ontologies to support chebi plus proteins and gene products. Ontology code is currently suggested but this is not an ontology field.
  4. Remove 'molecule_name'
  5. 'channel_id' should become 'channel_label'

Why is the new schema needed?

1 There are often multiple probes per target and therefore the granularity of the entity here should be probes rather than target. Users have naturally added multiple targets per row.

2-5 '_id' when used as primary key will be replaced with '_label' in other places in the schema. The primary key will no longer be molecule_id but 'probe_label'

hewgreen commented 5 years ago

NB this may be a good opportunity to nest this array in the imaging_protocol.json rather than having this as a module. Waiting for decisions on this from @simonjupp and @diekhans

zperova commented 5 years ago

@hewgreen I have further comments/questions.

First, I think we should leave the name of the schema as is because the information we are collecting is about the target, the probe is the implementation, in this case.

Second I propose to have the following fields: target label: cFos_ex1 target name: FBJ osteosarcoma oncogene target specification: part of exon 1 of cFos NCBI accession (or ensemble): NM_010234.2 target subcellular structure: nuclear probe label: cFos_probe1 probe sequence: GATGTTCTCGGGTTTCAACGCCGACTACG Manufacturer (or in house): in house assay type: MERFISH channel label: 1 fluorophore (if non-multiplexed): multiplexed?: yes

What do you think?

hewgreen commented 5 years ago

Your first point is not true as per point 1. As these experiments use multiple probes per target, contributors hacked the sheet to reflect their assay. They were forced to provide probes as entities rather than targets. When the mapping probe:target is 1:1 either name works but as the majority will be multiple probes per target this is required. One entity per row is consistent with the rest of the schema. (A second implementation would be to have another nested array so that we can provide lists of probes per target but then we are nesting too deeply.)

Regarding your list of attributes I don't see much difference. Although I wonder if we could combine target subcellular structure, target label, target name and target specification to a single ontology field. This would add a tremendous amount of clarity to the metadata. I think hiding the flexibility and complexity is a good idea here.

zperova commented 5 years ago

We want target to be searchable so it needs to be indexed, however, we do not want the probe sequence to be indexed. We can make a target label more specific to incorporate target name and target specification - sure. But it is possible to have a different subcellular structure for the same target - so I don't see a way of combining these two.

hewgreen commented 5 years ago

Indexing is not a problem here. This is dictated by other components based on their requirements rather than the metadata. We can add multiple valid ontology terms to one ontology field.

zperova commented 5 years ago

To summarize the discussion we and the decision we have arrived at. The module should be renamed probe.json and should contain the following fields: probe_label probe_sequence (includes only sequence corresponding to the target, no adapters) target name in the codebook target ontology text +rest of the fields (this will include subcellular localization either in the ontology itself or by adding another ontology) reagent module (for now to be consistent with the rest of the schema, might reconsider later on) Assay type (required) fluorophore channel_label (the same as in channel module) codeword (=in which channel you expect to see a probe over sequencing cycles, this field is user-requested, there is a discussion of the name being the most appropriate, but will add as is for now, until further feedback is received from other contributors)

zperova commented 5 years ago

after discussion with @daniwelter kept a separate field for target's subcellular structure

hewgreen commented 5 years ago

Can we have the discussion here? Two fields make it confusing because we only want one or the other filled in.