Overhaul imaging_protocol.json

HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA

Apache License 2.0

64 stars 32 forks source link

Overhaul imaging_protocol.json #368

Closed dosumis closed 5 years ago

dosumis commented 6 years ago

For which schema is a change/update being suggested?

type/protocol/imaging/imaging_protocol.json

What should the change/update be?

This is an umbrella ticket covering a number of changes including

Updates to field names
Updates to field / schema descriptions
New schemas modules
New schema fields
Refactoring to move some schema fields

Assumptions:

Imaging experiments involve the collection of images of a sample in one or more channels, each capturing different signals with information about different features of the sample. Those signals can come from a variety of different types of probes including: a fluorescent dye that marks nuclei (e.g. DAPI); an antibody against a marker of some cell feature (actin cytoskeleton; presynapse); an antisense flourescent nucleic acid probe that detects transcripts (smFISH). More recently, techniques have been developed that massively increase the number of features that can be probed. These typically involved multiple channels and multiple rounds of probe hybridisation. Signals are then decoded using a code book that translates signals from channels + hybridization rounds to detected genes.

These different approaches may be combined. For example, the SpaceTx project uses a variety of spatial transciptomics techniques, each combined with imaging of nuclei (using the fluorescent dye DAPI) and a secondary background stain (antibody based?) that detects some other aspect of subcellular anatomy.

Modelling strategy

Based on the assumptions outlined in the previous section, it seems safe to conclude that we need a modular schema that collects general microscopy information that covers all channels (e.g. imaging type: confocal microscopy) and allows this general information to be combined with information about one or more simple channels (e.g. DAPI channel) + a multi-channel assay (e.g. MerFish) in each case recording key information about what is being detected and how.
We have provisionally decided to move fields pertaining to imaging sample prep into a new imaging_sample_prep_protocol.json. Discussion of this decision would be most welcome.

Why is the change requested?

This is a first attempt at a general schema for imaging data. It is informed by the needs of the SpaceTx project, but intended to provide a general framework for capturing imaging experiment metadata.

What new field(s) need to be changed/added?

Fields in the current imaging_protocol.json require review. Some will be kept, some will be moved to down to modules, others will be moved to a new imaging sample prep schema.

Fields to keep:

field_microns; field_resolution ? (These will need expert review in future).

General fields:

description: ""
required: [microscopy_type]
properties:
    microscopy_type:  # replaces 'microscope' field, which is currently a simple enum.
        description: >
           The type of microscopy used for imaging (e.g. confocal; two photon, 
           super-resolution)
        type: object,
        $ref: module/ontology/microscopy_ontology.json  # TBA - will use imports from FBbi
    simple_channel_assays:
         type: array
         value: { $ref : module/imaging/single_channel_assay.json  }
    multi_channel_assay:
 # 1. Struggling to think of name that copes with this group's allergy to the term
 # 'assay' as understood by biologists
 # 2. It seems likely that other multi-channel, multiplexed techniques will follow
 #  that are not STx so keeping this open.
        type: object
        value: { OneOf: [$ref: module/imaging/spatial_transcriptomics.json] } 
    microscope_model:
        type: String
    objective:
       type: String
...     # Many other possible general fields here.

module/imaging/single_channel_assay.json

description: "A module for capturing information about a single channel used to collect a single signal type."
required: [protocol_type] 
properties: 
    channel_identifier:
        description: > 
               An identifier used for this channel within the experiment, 
               e.g. 'red channel' or channel 1.  This should align with the way channels are 
               referenced in file_names or file manifests.
    protocol_type:
        description: >
             The type of protocol used to assay for the signal in this channel, 
              e.g.  smFISH; antibody stain; fluorescent dye.  This field is required.
        type: object
        $ref: 
    reagent:
        description: >
             The specific reagent used to detect single in this channel.
              Examples include, a fluorescent probe used in smFISH;
              a comercial antibody; a commercial fluorescent dye.
        type: object 
        $ref: module/process/purchased_reagents.json
    target_gene_product:
       description: "Gene product targeted by reagent." 
       type:  String # Questions: Should this be an array. Should we allow for collection of gene/transcript/protein identifiers?
    target structure:  
            ontology: GO-subcellular component # Allow for anatomy here?

module/imaging/spatial_transcriptomics.json

description: >
    A module for capturing a general description of a spatial transcriptomics 
    experiments.
required: [spatial_transcriptomics_protocol_type,  target_gene_list]
properties:
    spatial_transcriptomics_protocol_type:
        description: > 
             The type of spatial transcriptomics protocol used, e.g. MerFISH, seqFISH etc
        type: object
        $ref: module/ontology/STx_ontology.json # TBA
   target_gene_list:
        type: array
        value: string # Should we allow for collection of gene/transcript/protein identifiers?
# Many possible additional fields dealing with hyb conditions, microfluidics etc.

imaging_sample_prep_protocol.json # Consider keeping this all under microscopy ?

description: > 
       A schema for collecting information on how a sample was prepared and mounted
       for imaging
required: []
properties: 
    fixative: 
    mounting medium:
    embedding: 
    clearing agent: 
    overview_image: 
        description: >
              Pointer/link to an overview image of the imaged sample.  
              Where multiple fields of view are collected during an assay, 
              it can be useful to have a lower resolution overview-image 
              covering the whole sample.  This field is for capturing a pointer/link 
              to such an image.
        type: string
    structures_within_sample:
      description: > 
          Anatomical structures/boundaries that are present within the sample. 
          These are potentially useful for indexed search and as landmarks for use 
          in image analysis (e.g. registration and segmentation).
      type: array
      value: { $ref: /modules/ontology/organ_part.json }

image_files.json

   # TBA: Generic schema and / core stuff.
   description: >
        Pointers to one or more image files or a data manifest that aggregates them.
   required: []
   properties:
     image_file_set:
         type: object
         $ref: /modules/ ... image_file_set.json
     data_manifest:
         type: string 
         description:  > 
              A JSON file that organises a set of image files from a single imaging experiment.
         # SpaceTx produces one of these.        
    pipeline_recipe: # Does this belong here?
        type: string 
         description:  > 
              A JSON file that describes an analysis pipeline for images.

image_file_set.json

# Stub We need a way to aggregate relatively simple image file data from a single experiment. Perhaps just collecting channel name and filename/pointer for each?

Consider what we might take from https://github.com/HumanCellAtlas/metadata-schema/blob/master/json_schema/type/file/sequence_file.json

joshmoore commented 6 years ago

A few initial comments:

We have provisionally decided to move fields pertaining to imaging sample prep into a new imaging_sample_prep_protocol.json. Discussion of this decision would be most welcome.

:+1:

TBA - will use imports from FBbi

See https://github.com/EBISPOT/efo/issues/47#issuecomment-342826257

Perhaps just collecting channel name and filename/pointer for each?

I’m not sure I follow, but in the general case, I don’t think this will work.

   Pointers to one or more image files or a data manifest that aggregates them.

Do you have an example of similar data_manifests in other domains? I’ve tried modelling this for an existing dataset in the IDR with all files listed. The json clocks in at 500MB. The alternative is a one-file-per-assay representation a la https://github.com/IDR/idr-metadata/blob/master/idr0016-wawer-bioactivecompoundprofiling/screenA/idr0016-screenA-plates.tsv assuming it’s ok to delegate the calculation of the fileset to a library.

pipeline_recipe: # Does this belong here?

My suggestion would be no since there is little agreement on standardized pipelines.

I was hoping to represent the IDR & OME metadata in json as a comparison here, but my (HCA-)jsonschema-fu still needs work. If it’s still of use, I can provide an abstract representation of the metadata we’re tracking (perhaps a simplified representation of the docs).

dosumis commented 6 years ago

Hi Josh,

Thanks for the comments

The first data we need to support is SpaceTx. They have their own data_manifest and pipeline_recipe in JSON. We need hooks for these somewhere. Our schema needs to be able to abstract this to point to other file aggregation systems (OME/IDR) and (probably) some simple generic specification that we define.

Whatever we do needs to allow mapping of channel metadata onto the corresponding (aggregations of) files or collections of files as appropriate where we have a single hook to some aggregation of other files. The plan right now is for this to work on channel identifier keys. This is almost certainly brittle, but I don't really see an alternative - at least for cases where we link to organized aggregations of files covering multiple channels as in SpaceTx.

I’ve tried modelling this for an existing dataset in the IDR with all files listed. The json clocks in at 500MB.

Presumably depends very much on:

the type of dataset - I don't think were expecting massive multi-well imaging data - although I could be wrong
the nature of the image files, e.g. multiple channels in a single file vs one channel per file

ambrosejcarr commented 6 years ago

David, this looks really great! I've added some comments, but mostly I wanted to continue our conversation about single- vs. multi-channel experiments and whether they require separation.

Single- vs. multi-channel experiments

I think coded and non-coded image-based transcriptomics and proteomics experiments are best described using a unified framework that revolves around the coding strategy used. In spaceTx we define a codebook object that specifies how channels and imaging rounds map to targets, and I think that could be nicely generalized here.

The below example describes a simple experiment with two imaging channels and two hybridization rounds. While it's appropriate for spaceTx to call these hybridization rounds, For the purpose of generality, I will refer to them as imaging rounds for the rest of this response.

codebook.json
[
  {
    "codeword": [
      { "h": 0, "c": 0, "v": 1 },
      { "h": 0, "c": 1, "v": 1 },
    ],
    "gene_name": "SCUBE2"
  },
  {
    "codeword": [
      { "h": 0, "c": 0, "v": 1 },
      { "h": 1, "c": 1, "v": 1 }
    ],
    "gene_name": "BRCA"
  }
]

I recall that two chief concerns about using this approach to specify non-coded experiments were the increased complexity for simple experiments like smFISH and a lack of generality for non-transcriptomic approaches. I wasn't able to fully articulate this at the time, but I don't believe either of these are significant impediments and explain why below.

Single-molecule FISH

Here's what the coding scheme might look like for an smFISH experiment.

codebook.json
[
  {
    "codeword": [
      { "h": 0, "c": 0, "v": 1 },
    ],
    "gene_name": "SCUBE2"
  },
  {
    "codeword": [
      { "h": 0, "c": 1, "v": 1 },
    ],
    "gene_name": "BRCA"
  }
]

I would argue that the concept of an imaging round is a comfortable one for an experimentalist doing an smFISH experiment. If need be we could relax the requirements on the codebook such that when omitted, h = 0 and v = 1. Then the codebook becomes:

codebook.json
[
  {
    "codeword": [
      { "c": 0 }
    ],
    "gene_name": "SCUBE2"
  },
  {
    "codeword": [
      { "c": 1 }
    ],
    "gene_name": "BRCA"
  }
]

Generalizing to other imaging targets

To generalize this concept to any other strategy, all we would need to do is exchange gene_name for a more general term like target, and now it works for nuclear stains, the actin cytoskeleton, or proteomics experiments.

Other comments

single_channel_assay.properties

This field is great, and should be recorder for every channel in either smFISH, multiplex or non-transcriptomic approaches -- we always want to know what the channels are.

 pipeline_recipe: # Does this belong here?
        type: string 
         description:  > 
              A JSON file that describes an analysis pipeline for images.

pipeline_recipe is pretty uncertain right now -- we have to figure out a lot of things on the secondary analysis side before we'll know if it's necessary. If it is, then we will probably want to include it. I would omit this for now but keep the idea around, as users will not be able to fill this value.

We need a way to aggregate relatively simple image file data from a single experiment. Perhaps just collecting channel name and filename/pointer for each?

If I'm understanding that you need a way to map images to their channels and hybridization rounds, you can extract this from the spaceTx manifest that the user will provide. See experiment.json and hybridization.json

hewgreen commented 6 years ago

I was thinking along the same lines as Ambrose regarding single vs multi channel. Inherently an array of one or more channels seems clear. The single_channel_assay.json nested into a channels field within imaging_protocol.json. If a channel is defined as the same frame, with differing targets and microscope settings, most microscopy is multichannel (multi target). It aids our reusability to think in these general terms if possible.

However, for smFISH multichannel takes on a different definition than the one I provided above, with the added complexity of sample manipulation between capture, hybridisation in this case. This adds the concept of 'round' to further granulate channels. If we accept that the codebook object completely captures the required metadata at this more granular hybridisation level, this object could be provided as an attribute in the single_channel_assay.json to cover that detail.

    "title": "imaging_protocol",
    "type": "object",
    "properties": {
        "channel": {
            "type": "array",
            "item" : {
                "$ref": "single_channel_assay.json"
                }
            # other attributes here relating to the whole imaging protocol
            }
    }

    "title": "single_channel_assay",
    "type": "object",
    "properties": {
        "codebook": {
            "type": "string"
          }
        # other attributes here relating to a single channel
    }

(NB. Instead of a string we would actually now take the codebook at a supplementary file object and point it to a specific protocol or entity. So the above isn't quite right but demonstrates the point.)

This was my working model but I see three road bumps with this approach:

Firstly, a minor issue, the name single_channel_assay.json may be misleading if we have one of these 'channel objects' describing a codebook when the codebook itself describes more than one channel (two in the example Ambrose gave).
If a channel doesn't have or is not part of a codebook in cases were contributors are unfamiliar with it, the fields in the codebook, h, v and gene name would need promoting to attributes at the single_channel_assay.json level. Clearly duplicated fields or the same information located in different places depending on presence of a codebook should avoided.
If we were to replace single_channel_assay.json with a codebook object (as Ambrose may be suggesting). We would need to be sure that the four attributes, h, c, v and gene name are the only attributes required to describe a channel for the imaging experiments we ingest. Even if these attributes are all that is required, this would tie HCA schema to the codebook format. If we had a strong use case to add/remove/edit one of these fields would we be able to? If we were, the codebook format would diverge and starfish analysis would break.

If the starfish analysis pipeline requires the codebook as input, there is a good reason not to edit it and break analysis pipelines just to cover other HCA metadata requirements. In this sense, the codebook is an input file format, not metadata. As such, if the downstream user does not require the data therein to be available directly via the HCA metadata, we should not extract the content as metadata. I would guess that as the codebook provides this information in a very usable format, extracting it has little added value. Relying on the codebook as a metadata standard for an attribute as fundamentally important as a target would not be great and hinder flexibility.

Despite these issues I think the above is very close to a solution. Without further use cases, I think we should create the single_channel_assay.json only when required, to describe a single channel (not smFISH). For me the smFISH does not fit neatly into the channel way of thinking without fully extracting the codebook (which seems daft). If smFISH specific fields are required we should make an smFISH module that we can bolt onto the imaging_protocol.json with the extra fields. The codebook format should not be adapted for our metadata purposes but should be linked to imaging_protocol.json as a supplementary_file object. This way we can move forward with the codebook and a flexible imaging schema but we sacrifice the ability to model metadata at the most granular level, a single channel of a single hybridisation round because we are allowing the codebook to represent this granularity. I think this is a worthy sacrifice to save overhead for now.

Other comments

A general comment on the codebook format, 'gene name' seem to be a poor attribute. I would think this could be more precise and use a standardised transcript identifier to better represent the real target, which isn't a gene. This can be used to extrapolate a gene name which would required downstream.
A key part of the puzzle I'm missing is the output format. Is the codebook interpreted in anyway or does it remain the only source to lookup gene names once analysis has ran? I need to run the pipeline myself to look into this.

ambrosejcarr commented 6 years ago

Firstly, a minor issue, the name single_channel_assay.json may be misleading if we have one of these 'channel objects' describing a codebook when the codebook itself describes more than one channel (two in the example Ambrose gave).

👍 We view the codebook as describing how a user would extract gene information from a set of one or more imaging rounds, and one or color channels. As you've identified, it summarizes an experiment, rather than an individual channel.

If a channel doesn't have or is not part of a codebook in cases were contributors are unfamiliar with it, the fields in the codebook, h, v and gene name would need promoting to attributes at the single_channel_assay.json level.

I don't understand what's being said here. In the case of an image-based transcriptomics experiment, the input data would fail validation without a codebook. Are you talking about the more general case?

If we were to replace single_channel_assay.json with a codebook object (as Ambrose may be suggesting). We would need to be sure that the four attributes, h, c, v and gene name are the only attributes required to describe a channel for the imaging experiments we ingest. Even if these attributes are all that is required, this would tie HCA schema to the codebook format. If we had a strong use case to add/remove/edit one of these fields would we be able to? If we were, the codebook format would diverge and starfish analysis would break.

I think our longer term vision is to have these formats become community owned. There should be a path for the HCA to suggest changes to the format. I think there are two questions here:

Should there be a separation between single-channel and multi-channel experiments? I think the answer is no, and it sounds like maybe you agree.
Is the Codebook the right representation for both image-based transcriptomics and the HCA to represent the mapping between an imaging experiment that may or may not be multiplexed and the targets of the assay (which may or may not be genes). We're pretty confident the answer is yes for image-based transcriptomics (and proteomics), but we're open to discussion here. If minor tweaks are needed for it to generalize, I could see us making them. We like reuse and simplicity.

If the starfish analysis pipeline requires the codebook as input, there is a good reason not to edit it and break analysis pipelines just to cover other HCA metadata requirements. In this sense, the codebook is an input file format, not metadata

We see it as an input file format. I think you should feel free to extract from it to support the schema if that's what you decide. Conversely, If you end up making suggestions that help generalize it that don't make it overly complicated, we'd love to hear your ideas.

Adding in @dganguli and @berl since they might have some opinions on how we want to tackle the idea of shared governance of the spaceTx formats.

ambrosejcarr commented 6 years ago

I'm not sure I totally understand this either:

However, for smFISH multichannel takes on a different definition than the one I provided above, with the added complexity of sample manipulation between capture, hybridisation in this case.

Are you using smFISH to describe both multiplex and non-multiplex assay types? (Some of the jargon is overloaded, sorry!). Our hope with a codebook is that it describes how you compose the channels and imaging rounds to extract gene information. It would augment the array of channels that describe the characteristics of the channel.

So, you have:

Imaging channels (array) : information on channel characteristics (reagent, fluorophore, etc)
Imaging rounds (array) : information on imaging rounds (make this optional, only needed in cases where the experiment is multiplex)
Codebook (array) : composition of imaging channels and hybridization rounds to identify genes (doesn't contain information on imaging rounds in the single-plex case, because an empty h value is inferred to be zero -- this is the final example in my original reply above).

Really interesting thoughts overall, Matt! Thanks for the really comprehensive reply. 😄

hewgreen commented 6 years ago

Sorry this is all so long winded by the way. Unavoidable I suppose.

I don't understand what's being said here. In the case of an image-based transcriptomics experiment, the input data would fail validation without a codebook. Are you talking about the more general case?

Yes a general case.

Should there be a separation between single-channel and multi-channel experiments? I think the answer is no, and it sounds like maybe you agree.

I do agree, ideally we wouldn't have a separation. I think most experiments are multichannel anyway (by one definition) but the hybridisation multiplexing complicates things a bit. If we can get to some multi-channel specific fields that are required they can be in a side bolt on metadata module which would be a nice way to handle this.

Is the Codebook the right representation for both image-based transcriptomics and the HCA to represent the mapping between an imaging experiment that may or may not be multiplexed and the targets of the assay (which may or may not be genes). We're pretty confident the answer is yes for image-based transcriptomics (and proteomics), but we're open to discussion here. If minor tweaks are needed for it to generalize, I could see us making them. We like reuse and simplicity.

Tweaks wise I'd be thinking gene_name to molecular_target (or similar) is a good idea anyway if you want to generalise. Gene is a very strange target for transcriptomics in humans. I'd also think about validation of this important field too and control it. Identifiers are your friend, especially ones resolvable through https://identifiers.org. Laura will have some great pointers on which databases are well maintained. The right identifier could theoretically be extrapolated out to an ontology term to find out what type of molecule it is if you didn't want to capture molecule type additionally as an enum but the latter would be easier.

If we need to index these attributes they would need to live in the HCA schema. To do this we can automatically map from the codebook into our schema. We'd add a few things like description, user friendly names and type which you don't need in the codebook. I think target is so important downstream, we need to index it. But if we don't need this information indexed, we can just take the file but then we'd need to be wary of duplicating information in the file in an ad hoc way.

Are you using smFISH to describe both multiplex and non-multiplex assay types?

Sorry, I'll state what I mean more specifically in future. Correct me where I'm wrong. Even with no rounds of hybridisation, as I understand smFISH can be multichannel i.e. have multiple probes and require multiple laser/filter sets to gather the image. Multiplexed would always imply that hybridisation rounds were performed. SpaceTx is mainly for MERFISH, which is multiplexed smFISH.

The point I was making here is that we can't strictly think in terms of channels because of the multiplexing. Maybe we could think about single channels and multiplexed channels in one channels.json. Maybe this is what Dave is trying to get at in the first place. Then if a channel_type : "multiplexed" we could also require the codebook. For these 'pseudo channels' where channel_type : "multiplexed", conditionally we could check that the target was not singular and extrapolate a list from the codebook for indexing and essentially map the important bits of the codebook into HCA metadata. I think this is the way forward.

berl commented 6 years ago

great discussion so far.

For starfish, the codebook just needs to keep track of the relationship between image files and genes, but I think the codebook is a great way to efficiently represent the information we need and it's easily generalized for broader HCA use by adopting the target instead of gene and round instead of hybridization.

Additionally, the imaging metadata needs to also associate image collection information and sample prep information with image files. but would need to be significantly expanded to accomodate image collection information inside it. In particular, the fluorescence channels used should be associated with an entire imaging protocol, but we don't want to track that all the way down into the codebook.

The point I was making here is that we can't strictly think in terms of channels because of the multiplexing. Maybe we could think about single channels and multiplexed channels in one channels.json.

Would it help to treat the codebook as the mapping between explicitly listed targets and acquisition settings ( channels)?

painful pseudojson below, likely inscrutable:


properties:{
    targets:[<list, e.g. flattened from codebook, indexed, searchable, etc>]
        acquisition_info:{<acquisition info common across all channels, e.g. microscope objective>}
    acquisition_channels:[
    {"index":0
    "name": "488"
    "microscopy_settings": { <channel-specific info>}
    },
    {"index":1
    "name": "561"
    microscopy_settings:{<channel-specific info>}
    }
    ...
    {}
    ],
    codebook:{
    target1:[r,c,v]   
    target2:[r,c,v]
    }

}```

ambrosejcarr commented 6 years ago

If we can get to some multi-channel specific fields that are required they can be in a side bolt on metadata module which would be a nice way to handle this.

👍

We'd add a few things like description, user friendly names and type which you don't need in the codebook

Could you describe what these are? I want to make sure we really don't have them (for example, we do have some type control over what these can refer to. c is an integer that indexes into the list of channels to identify which channel is used. If we don't have them, I want to make sure we don't need them.

Tweaks wise I'd be thinking gene_name to molecular_target (or similar) is a good idea anyway if you want to generalise.

I'll drop an issue into our backlog and make sure others agree, but I think we can make this change 👍

I'd also think about validation of this important field too and control it. Identifiers are your friend, especially ones resolvable through https://identifiers.org. Laura will have some great pointers on which databases are well maintained. The right identifier could theoretically be extrapolated out to an ontology term to find out what type of molecule it is if you didn't want to capture molecule type additionally as an enum but the latter would be easier.

This is a cool idea, I'll drop a ticket in our backlog to look into it.

Sorry, I'll state what I mean more specifically in future. Correct me where I'm wrong. Even with no rounds of hybridisation, as I understand smFISH can be multichannel i.e. have multiple probes and require multiple laser/filter sets to gather the image. Multiplexed would always imply that hybridisation rounds were performed. SpaceTx is mainly for MERFISH, which is multiplexed smFISH.

The point I was making here is that we can't strictly think in terms of channels because of the multiplexing.

I'm not a microscopy expert so I'm probably missing something obvious here.

From the disagreement here my best guess that there is question about whether a single channel will map directly to a specific molecular_target (my assumption above), or if there needs to be support for combinations of channels that map to specific molecular targets? I think a concrete example might clear up my confusion. Sorry!

berl commented 6 years ago

The point I was making here is that we can't strictly think in terms of channels because of the multiplexing.

Yeah, this is why I think the channels (and their associated imaging metadata) should be listed in their own property, and the targets should also be listed in their own property. If the targets and channels are each their own fields and the codebook stores the correspondence, I don't think there's a need for channel_type : "multiplexed". The smFISH example is just a diagonal codebook with round=0 (or absent) and one target for each channel.

hewgreen commented 6 years ago

Could you describe what these are? I want to make sure we really don't have them

Here I'm talking about the json fields each attribute/field has in the schema following our style guide. so 'c' would have a description, user-friendly name etc. This isn't a big deal, I wasn't talking about new attributes.

@berl this is a neat idea. It certainly has legs from a first look. It may help overcome the point Ambrose and I are discussing because it reduces the ambiguity of the channel notion. We are simply saying a channel contains metadata specific to a target (or target list) which could apply across other techniques. At this point though we are extracting the whole codebook into HCA metadata (which isn't a bad thing).

Having targets as their own entity allows us to add extra metadata to them if required, link and reuse them across a project. I really like that.

I need to clarify something. Say we have 2 channels c1, c2 and we do 10 hybridisation rounds. I understand we can probe more than 20 targets. Is that right? Maybe this is where I need to know what v is. Further, will a codebook ever contain duplicate target entries? Vice versa would you ever see duplicates of c, v, h with different targets (i.e. dual target detection)?

berl commented 6 years ago

I need to clarify something. Say we have 2 channels c1, c2 and we do 10 hybridisation rounds. I understand we can probe more than 20 targets. Is that right?

yes, but you've chosen a corner case for your example! An easier illustrationf you have 5 channels and 1 round, if each target is identified by two channels, you have 5 choose 2 = 10 unique targets.

Maybe this is where I need to know what v is.
v is supposed to be shorthand for the intensity value, with the idea that it's 1 for something labeled and implicitly zero otherwise. Ambrose allowed it to be any value because you can imagine a codebook where different ratios of channels encoded different targets

Further, will a codebook ever contain duplicate target entries? Vice versa would you ever see duplicates of c, v, h with different targets (i.e. dual target detection)?

a codebook should not contain exactly duplicate target names....but.... if an mRNA is probed twice in an experiment (this is OK), I believe it needs to have some kind of identifier appended in the codebook, so you can keep track of the results.

you should never see a unique c,v,h with multiple targets.

I realize that the duplicate entries in the codebook as specified above makes a mess of the list of targets... now there need to be 2 lists of targets: one list of unique biological targets and one list of unique probed-targets that includes targets probed multiple times as unique entries.

Is there a better/standard way to do this?

hewgreen commented 6 years ago

That's very useful. The penny has finally dropped for me. Thanks. And is v the 'fluorescence output' an amplitude?

ambrosejcarr commented 6 years ago

Nailed it. Yes.

So far we're just using it as "on" (1) or "off" (0) but we wanted something general if someone does something weird. :-)

ambrosejcarr commented 6 years ago

Yeah, this is why I think the channels (and their associated imaging metadata) should be listed in their own property, and the targets should also be listed in their own property. If the targets and channels are each their own fields and the codebook stores the correspondence, I don't think there's a need for channel_type : "multiplexed". The smFISH example is just a diagonal codebook with round=0 (or absent) and one target for each channel.

I agree strongly with this. I'll add that we could also have an array for the imaging_rounds if we feel a compelling reason -- define all the entities separately, then the codebook just glues things together.

hewgreen commented 6 years ago

I can see how sparse the codebook matrix is now. This is why flipping it so channel or/and hybridisation round is at a higher level than target may not be a good idea. In general now I'm put off extracting the code book if possible. As glue it works well. Downstream users care about the targets and the target lists but after the pipeline's deplexing and processing magic they wont care about getting h and c from the metadata too much. If they ever needed it they would probably just go to the codebook directly.

So a "channel" (now a terrible word for this because it only works for non multiplexed microscopy) describing a multiplexed experiment simply needs a codebook and a list of targets (these may be objects if there are any extra metadata attributes needed). Aux images of the same frame would be in a separate "channel". No codebook needed but the target and stain would still be e.g.DNA/DAPI.

berl commented 6 years ago

So a "channel" (now a terrible word for this because it only works for non multiplexed microscopy) describing a multiplexed experiment simply needs a codebook and a list of targets (these may be objects if there is any extra metadata they need to have themselves). Aux images of the same frames would be in a separate "channel". No codebook needed but the target and stain would still be e.g.DNA/DAPI.

@hewgreen I agree the language around channel gets a bit muddy here. I think we still need something to list acquisition channels like "488" or "IR_DIC" or whatever and reference channel-specific acquisition information. perhaps we can use acquisition_channels for that instead?