Associating measurement data and sample descriptions

jakebeal commented 1 year ago

In LabOP, the SampleData object is used to represent the relationship between a set of containers and a set of measurements taken from those containers. This is not yet enough to support analysis, however. Typically, in an analysis, we will also need to relate those measurements to the independent variables of an experiment. There are several ways that we might approach this question, and a number of possible implementations.

Specifications to associate with the samples could come at various levels of detail, listed here from more complete to most minimal:

Full trace of the operations that have happened to each sample: not tractable for analysis
Model of the state of the samples at the time of measurement: often not be possible to compute (e.g., behavior of cells during incubation in an experimental study)
Model of the contents of each sample at key time points (e.g., 10K cells in X ul media with Y nM inducer): key point identification may be difficult to automate
Independent variables only (e.g., inducer concentration): may omit important information

This information could be supplied in several ways, from most manual to most automated:

Human manually supplies the specifications for each sample in the dataset: puts a burden on the human, prevents dynamic choices
Explicit identification of "snapshot" points on model: can require complex model inference (e.g., serial pipetting)
Automated inference of appropriate snapshot points: may not be possible in general

Different combinations are likely to be appropriate for different models, so I propose that we address the issue by making the labeling of sample data a first class object.

Specifically:

Add an optional “labels” field to SampleCollection that takes an array of sbol:Component URIs pointing to sample descriptions.
- The array data must have the same shape as the base SampleArray for the SampleCollection
Add at least two primitives for labeling SampleCollection objects:
- ExcelToLabels: take an Excel spreadsheet and a SampleCollection and extract descriptions for the collection from a cell array in the spreadsheet
- ModelToLabels: snapshot a dynamic model of SampleCollection contents into labels
SampleData objects will then be implicitly labelled via the labels on the SampleCollection they reference.
If we are able to infer snapshot points, that would be LabOPed suggesting where to add “ModelToLabels” operations in a protocol

This combinations will allow us to use manual specification initially and for situations where we can’t make a model, transitioning to greater levels of automation over time. We will also not have to modify any measurement operators, because the labels will be carried implicitly by the SampleCollection they are told to measure.

jakebeal commented 1 year ago

... and it looks like my proposal for a "labels" field is already there in the form of the "contents" field that was recently added to SampleArray. My thoughts:

I think we need to have a carefully discussion of the "contents" field to make sure that its usage is compatible with the proposed "labels" interpretation, since its documentation is currently quite terse and doesn't show up in the specification.
Can we change "contents" to be optional?

danbryce commented 1 year ago

Just to clarify how things currently work ...

We have SampleCollection (an abstract class), with subclasses: SampleMask and SampleArray.

SampleMask has attributes: SampleCollection source, string mask SampleArray has attributes: ContainerSpec container_type, string contents

The SampleData class has attributes: string values, SampleCollection from_samples

In test/test_samplemap.py, the SampleData.values are a serialized xarray.DataSet objects that map aliquot ids (e.g., A1, B2, etc.) to a scalar. The SampleArray.contents are serialized xarray.DataArray objects that map aliquot ids (e.g., A1, B2, etc.) and reagents (sbol component URIs) to volumes. Joining these on the aliquot id would provide a description of the contents and measurement for each aliquot.

--

I think the point @jakebeal is making, is that it's not clear what SampleData.from_samples is describing wrt. SampleData.values.

danbryce commented 1 year ago

Some more detail from test/test_samplemap.py:

The SampleArray.contents for the target is:

'{"dims": ["array", "aliquot", "contents"], "attrs": {"units": "uL"}, "data": [[[0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0]]], "coords": {"array": {"dims": ["array"], "attrs": {}, "data": ["target"]}, "aliquot": {"dims": ["aliquot"], "attrs": {}, "data": [0, 1, 2, 3]}, "contents": {"dims": ["contents"], "attrs": {}, "data": ["https://bbn.com/scratch/ddH2Oa", "https://bbn.com/scratch/ddH2Ob"]}}, "name": null}'

and the absorbance SampleData.values is:

'{"coords": {"aliquot": {"dims": ["aliquot"], "attrs": {}, "data": [0, 1, 2, 3]}}, "attrs": {}, "dims": {"aliquot": 4}, "data_vars": {"absorbance": {"dims": ["aliquot"], "attrs": {}, "data": [null, null, null, null]}}}'

These are the values after the execution engine executes the protocol. It is somewhat confusing for the target array to refer to its initial contents, seeing how the absorbance is not measured over the initial target contents (rather, the dynamic contents).

If we made the SampleArray.contents optional, then SampleData.values would not have a known dimensionality. (The execution engine creates the SampleData object in terms of the MeasureAbsorbance.samples which is bound to the target SampleArray.)

Using an ExcelToLabels primitive would give us the dimensionality of the SampleData, assuming we make SampleArray.contents optional.

I can see this working like this:

EmptyContainer.samples --> MeasureAbsorbance.samples MeasureAbsorbance.absorbance --> ExcelToLabels.data "" --> ExcelToLabels.labels (value pin)

If EmptyContainer.samples.contents is not specified, then MeasureAbsorbance.samples.values will be unspecified. However, ExcelToLabels.labels will impose the shape of MeasureAbsorbance.samples.values.

jakebeal commented 1 year ago

I think that I am still struggling to understand the current semantics of SampleArray.contents. Let me try to work this through with an example.

Let's say we have a protocol that executes with two sequential operations:

EmptyContainer produces a SampleArray OneTube containing a single empty 5mL tube.
Provision then puts 2 mL of M9 media into OneTube.

In a ProtocolExecution, these would be recorded with two instances of ActivityNodeExecution and one instance of ActivityEdgeFlow that connects them --- there is no outgoing edge from Provision. The OneTube object is referred to by the ActivityEdgeFlow, so presumably OneTube.contents should be empty (no media).

Is that correct?

jakebeal commented 1 year ago

Per discussion on Zoom, we are recognizing that SampleArray.contents cannot be used to represent a potentially changing value, since one would not be able to record separate values for the contents property in a serialized execution trace. Currently, it is being used to mean the initial contents of a sample array, and will thus rename it SampleArray.initial_contents. My initial labels field proposal is also unsuitable for recording a trace.

Instead, in order to associate metadata, we will change the direction of the pointer and have a SampleMetaData object that is analogous to SampleData, except that it has sampleDescriptions instead of sampleDataValues. A DataSet object will then associate a SampleData and a SampleMetaData (which must have equal fromSamples properties)

To follow the sample above, measuring the OD of the tube would be a five operation protocol:

EmptyContainer produces a SampleArray oneTube containing a single empty 5mL tube.
Provision then puts 2 mL of M9 media into oneTube.
ModelToMetadata then takes in oneTube and produces oneMetadatum that says the tube contains 2 mL of M9 media.
MeasureAbsorbance then takes in oneTube and produces oneDatum that says the tube had OD600 = 0.7
LabelData takes in oneMetadatum and oneDatum and produces oneDataset, a ready-for-analysis dataset.

jakebeal commented 1 year ago

I've set up a pull request containing the initial model (#184). I have not changed any of contents references in code to initial_contents so this probably won't yet work.

jakebeal commented 1 year ago

I've now done a search-and-replace on the code, which should take care of the errors, but there are some semantic issues exposed by the change that need to be addressed, notably in the markdown_specialization.py

jakebeal commented 1 year ago

Pull request is ready for review and merging.

bbartley commented 1 year ago

Clarifying question: If I understand the example above correctly, the ModelToMetadata action is taking an execution snapshot of the state of a SampleArray at a given point. So, the point is just to infer what are the contents of the container at a given step in the protocol?

jakebeal commented 1 year ago

With ModelToMetadata, I am assuming that the execution engine has been inferring the contents of. the SampleArray on its own in some internal way. The ModelToMetadata operation grabs that model and copies it into a SampleMetadata for access.

bbartley commented 1 year ago

Is it also true, according to this proposal, that the contents of that SampleArray are not dynamically updated to track the state of the contents? So, the execution engine could infer the contents model as it executes, but, practically speaking, that may not actually be necessary until the call to ModelToMetadata executes?

jakebeal commented 1 year ago

In this proposal, contents is changed to initial_contents and is NOT updated, since updating it would invalidate trace recordings of the initial state (in the future we might even remove it, but it's being kept for backward compatibility with the code at this time).

Thus, practically speaking, the LabOP representation per se has no way to record when the execution makes its inferences. The execution engine is thus free to calculate as it goes, at the call for ModelToMetadata or any other combination that works for it.

bbartley commented 1 year ago

It seems like these two changes work toward cross-purposes: 1) The ModelToMetadata makes the state of the model accessible at a given point in execution 2) contents field is no longer used to track state

If we can use 1) to pull out a snapshot from the execution trace (and I am on board with that idea), then why should it matter if the contents attribute is dynamic? (I understand we have "invalidated the execution trace", but that shouldn't matter, because now the user can explicitly pull out a snapshot anytime they need it.)

jakebeal commented 1 year ago

This, indeed, does not have a representation for tracking state. Previously, however, the contents field wasn't actually tracking state either: we were kludging it for that use in some cases, but its semantics were self-contradictory.

Per the conversation with @danbryce yesterday, the representation of the state of containers and equipment is a potentially very deep representational question, because of the questions of time and ordering that get involved. These are not needed for a snapshot, though. We will likely want to address state representation at some point in the future, but if we can avoid standardizing it for now, then we can keep experimenting pragmatically in execution engines without having to commit to how general representations of evolving state are shared (only snapshots).

Yes, we will need a place other than the "contents" field to put state into. Lots of ways to do that, though, including simply having the execution environment keep a dictionary mapping SampleArray to state.

Bioprotocols / labop

Associating measurement data and sample descriptions #183