Open jakebeal opened 1 year ago
... and it looks like my proposal for a "labels" field is already there in the form of the "contents" field that was recently added to SampleArray. My thoughts:
Just to clarify how things currently work ...
We have SampleCollection
(an abstract class), with subclasses: SampleMask
and SampleArray
.
SampleMask
has attributes: SampleCollection source
, string mask
SampleArray
has attributes: ContainerSpec container_type
, string contents
The SampleData
class has attributes: string values
, SampleCollection from_samples
In test/test_samplemap.py, the SampleData.values
are a serialized xarray.DataSet objects that map aliquot ids (e.g., A1, B2, etc.) to a scalar. The SampleArray.contents
are serialized xarray.DataArray objects that map aliquot ids (e.g., A1, B2, etc.) and reagents (sbol component URIs) to volumes. Joining these on the aliquot id would provide a description of the contents and measurement for each aliquot.
--
I think the point @jakebeal is making, is that it's not clear what SampleData.from_samples
is describing wrt. SampleData.values
.
Some more detail from test/test_samplemap.py:
The SampleArray.contents
for the target
is:
'{"dims": ["array", "aliquot", "contents"], "attrs": {"units": "uL"}, "data": [[[0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0]]], "coords": {"array": {"dims": ["array"], "attrs": {}, "data": ["target"]}, "aliquot": {"dims": ["aliquot"], "attrs": {}, "data": [0, 1, 2, 3]}, "contents": {"dims": ["contents"], "attrs": {}, "data": ["https://bbn.com/scratch/ddH2Oa", "https://bbn.com/scratch/ddH2Ob"]}}, "name": null}'
and the absorbance SampleData.values
is:
'{"coords": {"aliquot": {"dims": ["aliquot"], "attrs": {}, "data": [0, 1, 2, 3]}}, "attrs": {}, "dims": {"aliquot": 4}, "data_vars": {"absorbance": {"dims": ["aliquot"], "attrs": {}, "data": [null, null, null, null]}}}'
These are the values after the execution engine executes the protocol. It is somewhat confusing for the target array to refer to its initial contents, seeing how the absorbance is not measured over the initial target contents (rather, the dynamic contents).
If we made the SampleArray.contents
optional, then SampleData.values
would not have a known dimensionality. (The execution engine creates the SampleData
object in terms of the MeasureAbsorbance.samples
which is bound to the target SampleArray.)
Using an ExcelToLabels
primitive would give us the dimensionality of the SampleData
, assuming we make SampleArray.contents
optional.
I can see this working like this:
EmptyContainer.samples --> MeasureAbsorbance.samples
MeasureAbsorbance.absorbance --> ExcelToLabels.data
"
If EmptyContainer.samples.contents is not specified, then MeasureAbsorbance.samples.values will be unspecified. However, ExcelToLabels.labels will impose the shape of MeasureAbsorbance.samples.values.
I think that I am still struggling to understand the current semantics of SampleArray.contents
. Let me try to work this through with an example.
Let's say we have a protocol that executes with two sequential operations:
EmptyContainer
produces a SampleArray
OneTube
containing a single empty 5mL tube.Provision
then puts 2 mL of M9 media into OneTube
.In a ProtocolExecution
, these would be recorded with two instances of ActivityNodeExecution
and one instance of ActivityEdgeFlow
that connects them --- there is no outgoing edge from Provision
. The OneTube
object is referred to by the ActivityEdgeFlow
, so presumably OneTube.contents
should be empty (no media).
Is that correct?
Per discussion on Zoom, we are recognizing that SampleArray.contents
cannot be used to represent a potentially changing value, since one would not be able to record separate values for the contents
property in a serialized execution trace. Currently, it is being used to mean the initial contents of a sample array, and will thus rename it SampleArray.initial_contents
. My initial labels
field proposal is also unsuitable for recording a trace.
Instead, in order to associate metadata, we will change the direction of the pointer and have a SampleMetaData
object that is analogous to SampleData
, except that it has sampleDescriptions
instead of sampleDataValues
. A DataSet
object will then associate a SampleData
and a SampleMetaData
(which must have equal fromSamples
properties)
To follow the sample above, measuring the OD of the tube would be a five operation protocol:
EmptyContainer
produces a SampleArray
oneTube
containing a single empty 5mL tube.Provision
then puts 2 mL of M9 media into oneTube
.ModelToMetadata
then takes in oneTube
and produces oneMetadatum
that says the tube contains 2 mL of M9 media.MeasureAbsorbance
then takes in oneTube
and produces oneDatum
that says the tube had OD600 = 0.7LabelData
takes in oneMetadatum
and oneDatum
and produces oneDataset
, a ready-for-analysis dataset.I've set up a pull request containing the initial model (#184). I have not changed any of contents
references in code to initial_contents
so this probably won't yet work.
I've now done a search-and-replace on the code, which should take care of the errors, but there are some semantic issues exposed by the change that need to be addressed, notably in the markdown_specialization.py
Pull request is ready for review and merging.
Clarifying question: If I understand the example above correctly, the ModelToMetadata
action is taking an execution snapshot of the state of a SampleArray
at a given point. So, the point is just to infer what are the contents of the container at a given step in the protocol?
With ModelToMetadata
, I am assuming that the execution engine has been inferring the contents of. the SampleArray
on its own in some internal way. The ModelToMetadata
operation grabs that model and copies it into a SampleMetadata
for access.
Is it also true, according to this proposal, that the contents
of that SampleArray
are not dynamically updated to track the state of the contents? So, the execution engine could infer the contents model as it executes, but, practically speaking, that may not actually be necessary until the call to ModelToMetadata
executes?
In this proposal, contents
is changed to initial_contents
and is NOT updated, since updating it would invalidate trace recordings of the initial state (in the future we might even remove it, but it's being kept for backward compatibility with the code at this time).
Thus, practically speaking, the LabOP representation per se has no way to record when the execution makes its inferences. The execution engine is thus free to calculate as it goes, at the call for ModelToMetadata
or any other combination that works for it.
It seems like these two changes work toward cross-purposes:
1) The ModelToMetadata
makes the state of the model accessible at a given point in execution
2) contents
field is no longer used to track state
If we can use 1) to pull out a snapshot from the execution trace (and I am on board with that idea), then why should it matter if the contents
attribute is dynamic? (I understand we have "invalidated the execution trace", but that shouldn't matter, because now the user can explicitly pull out a snapshot anytime they need it.)
This, indeed, does not have a representation for tracking state. Previously, however, the contents
field wasn't actually tracking state either: we were kludging it for that use in some cases, but its semantics were self-contradictory.
Per the conversation with @danbryce yesterday, the representation of the state of containers and equipment is a potentially very deep representational question, because of the questions of time and ordering that get involved. These are not needed for a snapshot, though. We will likely want to address state representation at some point in the future, but if we can avoid standardizing it for now, then we can keep experimenting pragmatically in execution engines without having to commit to how general representations of evolving state are shared (only snapshots).
Yes, we will need a place other than the "contents" field to put state into. Lots of ways to do that, though, including simply having the execution environment keep a dictionary mapping SampleArray to state.
In LabOP, the
SampleData
object is used to represent the relationship between a set of containers and a set of measurements taken from those containers. This is not yet enough to support analysis, however. Typically, in an analysis, we will also need to relate those measurements to the independent variables of an experiment. There are several ways that we might approach this question, and a number of possible implementations.Specifications to associate with the samples could come at various levels of detail, listed here from more complete to most minimal:
This information could be supplied in several ways, from most manual to most automated:
Different combinations are likely to be appropriate for different models, so I propose that we address the issue by making the labeling of sample data a first class object.
Specifically:
This combinations will allow us to use manual specification initially and for situations where we can’t make a model, transitioning to greater levels of automation over time. We will also not have to modify any measurement operators, because the labels will be carried implicitly by the SampleCollection they are told to measure.