Representation of array data

jakebeal commented 2 years ago

We have three places in PAML where we currently need to represent N-dimensional array-structured information:

The contents property of SampleArray is an array of URIs to SBOL Component objects
The mask property of SampleMask is an array of booleans
The sampleDataValues property of SampleData is an array of data (e.g., absorbance values) or links to external data objects (e.g., FCS files).

We need a way to represent and serialize these objects. The following possibilities have been discussed so far:

Represent arrays using RDF objects. Semantically clean, but bloated, must be converted to other forms for effective manipulation.
Represent arrays using JSON (or NumPy or xarray or Pandas...)
Represent arrays using a linked "sidecar file"
Represent data using https://www.animl.org (https://en.wikipedia.org/wiki/AnIML)

photocyte commented 2 years ago

My recollection from the PAML weekly meeting is at least the bullet points contents (1) and mask (2) above, colloquially were the "platemaps" question? If so, I think it comes down to how those platemaps are desired to be edited. E.g. via the PAML editor? Or, an external tool? If an external tool, while it is a bit icky, I don't think you can beat an .xlsx template for different container types (i.e. 24-well, 96, 384) to maintain widespread interoperability while having a bit more control over column/row metadata vs a CSV. Such .xlsx templates could also be converted to language specific dataframe formats (R?, Python), with only a few lines of code.

My personal preference is to do things this sort of masking of data in analyzing plates using Pandas dataframes. I have seen some libraries in the Python space for doing platemap type things. None I've used enough to recommend.

These Javascript libraries seem relevant to the plate map question if going for a web based editor: https://github.com/nebiolabs/plate-map https://github.com/abolis-biotechnologies/plate-maker https://github.com/vindelorme/PlateEditor

For bullet point sampleDataVales (3), this is a tricky thing. Regarding linked sidecar files, for the flow cytometry space, gratefully there is a vendor standard in FCS. To my knowledge, that is the only such example, all other instrument/measurement types have vendor specific data formats, some binary, some XML, some .xlsx. In cases where it can be converted to an open data format, i.e. LC-MS ".raw" -> ".mzML", some data is lost, not represented, or "out of scope" for the file format/standard. AniML, I think aspires to try to address that conundrum of having a broader container that can hold more types of measurement types.

It's always possible to just base64 the raw arrays of values into the PAML file, but that strikes me as a sort of minimum viable solution, vs, having more metadata be captured.

jakebeal commented 2 years ago

Plate-maps will indeed be a typical case, and I agree with you about Excel as an I/O interface. There will be cases that it's not quite that simple, e.g., expressing a factor space such as "three replicates of these fix strains against these three induction levels in these four media". Those N-dimensional expressions can still often be done cleanly in a spreadsheet-like table, however.

How well does Pandas generalize to more than 2 dimensions? Is it clean, or is it more focused on 2D? Also, is Pandas python-specific, or is it a shared format that has support in other languages as well? JSON is clearly an available lowest common denominator...

With respect to linking to sidecar files, formats is fortunately a problem that we can pass of to a pre-existing solution: the EDAM ontology (https://edamontology.org/) has been systematizing data formats, and is already used in the SBOL Attachment class for indicating the type of a linked file. We just need to decide when we ingest data into PAML and when we leave it as an attachment.

photocyte commented 2 years ago

I believe a Pandas dataframe is explicitly limited to 2D. This StackOverflow seems to support that: https://stackoverflow.com/questions/24290495/constructing-3d-pandas-dataframe

But, Numpy data structures are N-dimensional & can also hold objects. I don't know if there is a well accepted / widely supported serialization format for Numpy data structures. Clearly if it the data contained is universally string rather than objects, there is probably a Json serializer. A quick test would be whichever is the most accepted way for converting Numpy stuff to the R equivalent (of which I am not familiar) - it may be CSV or .xlsx conversion... Clearly for N-dimensional representation, a 3D representation, could be a list of .xlsxs, for 4D could have a list of list of .xlsxs.

Thanks for the link to EDAM, was not aware of it, & interested to read up on it a bit more. I agree, it comes down to what to embed vs what to link to, and leave it up to the PAML user to be able to read the linked file even if it is a weird file format. For embedded data, PAML should take the initiative to have high interoperability & metadata inclusion for said data. Given the confusion of having embedded vs linked files for data, seems worthwhile to have only linked files.

P.S. speaking of side-car files, could XMP files (https://en.wikipedia.org/wiki/Extensible_Metadata_Platform ) describing the "vendor format" sidecar files use the EDAM ontology? Then executed PAML protocols might have a format for interoperabily per-file metadata as well that is agnostic of whether the data file format can embed metadata.

jakebeal commented 2 years ago

My point about embedded data is that there are certain types of data that a PAML execution engine needs to be able to actively manipulate. SampleMask is the primary example of such a data type, since protocols often need to be able to select a subset of samples at runtime and to detect error conditions. For example, one may wish to determine when a culture has grown enough to proceed to the next step of a protocols, or to check which of a set of assembly reactions have succeeded.

We should be very conservative about what we allow, however, since we don't want to open up general computation as part of the responsibilities of an execution engine.

noahsprent commented 2 years ago

Don't really have any opinions on any of this, but just noticed the Pandas question above and wanted to chime in that I use DataFrames for multi-dimensional data all the time, using multi-indices:

https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#hierarchical-indexing-multiindex

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

I would normally serialise the object as a simple .csv for simplicity, which could obviously also be parsed by other languages, but no information is retained about which rows/columns are headers/indices vs data. This doesn't sound like multi-dimensionality in the way that you need it, and on first impressions Pandas doesn't seem like the right tool, but just wanted to throw it in in case it's useful!

bbartley commented 2 years ago

We should perhaps consider that the contents property of SampleArray is an array of URIs to SBOL Implementation objects rather than Components

jakebeal commented 2 years ago

I'm not certain whether that's always true... if we're planning how a protocol should run, we don't have the samples yet, and we might get multiple sets of samples for multiple runs. If we've actually run the protocols, however, then in the trace it would indeed be Implementation objects.

rpgoldman commented 2 years ago

The xarray program, which is a sort of pandas for n-dimensional arrays, uses netCDF as its preferred file format. See this documentation page.

I don't have any deep knowledge of netCDF, personally. Here's its web site

The xarray folks and netCDF folks seem to be mostly earth science people, because they have so much need for large, multi-dimensional data sets. My experience with them comes from working on the arviz and pymc projects, which use xarray data structures to store and process the results of Monte Carlo sampling.

danbryce commented 2 years ago

I'm chiming in now that I touched up PR #76. Its our current version of handling data, so I'll summarize what it does:

SampleArray represents the data with a PrimitiveArray. Each element of a PrimitiveArray is a OrderedPropertyValue (to support array indices), which holds a value of type uml:Literal. It is currently used to represent a list of aliquot ids as StringLiterals (not URIs to SBOL components)
SampleMask is a string (e.g., "A1:B2"), not an array of Booleans. I'm not sure an array of boolean is right either. It would depend on the SampleArray representation so that you can identify which element each Boolean references. I don't particularly like using array indices to align elements of the SampleArray and the SampleMask. I think the current string supports what we need right now, but ultimately, the SampleMask is a filter/constraint on the SampleArray. I don't know what sorts of other constraints would be useful. I suppose the editor would hide the details, and using an index aligned boolean array could work.
SampleData.sampleDataValues is currently a string (holding a CSV), not an array.

In response to the suggestions above, I would comment:

Arrays as RDF objects: I looked into some standards for this, and they seemed to be particularly complicated. I don't want to make my own representation.
JSON, Pandas, Numpy, etc: I think these are better for actual data, and are passable at holding some metadata. I would want to limit the metadata stuffed into these formats, because this is where PAML should shine. Adopting this would require a metadata attachment of some sort, that describes each datapoint in the data. To enforce that we don't push metadata into the data format, we would probably want Numpy.
Sidecar files: We need these for data that has existing formats that are widely adopted, such as FASTA or FCS.
ANIML: This could work, but may be too complex for our needs. I don't like depending no something that hasn't been maintained recently.

The pointers to the plate editors seem like viable options for the PAML Editor. I found some CSV editors that would work too, but these existing options are attractive. If these are the front end format, then we should ensure that it can be represented in the RDF. I'm a little concerned about how to support metadata authoring in these editors. For example, how would a user extend the metadata by adding a new attribute to each sample? PAML should model generic metadata attributes in this case.

A possible next consideration is how to "enhance" metadata by pushing the description of each sample through the protocol. For example, by adding a column to the SampleData that includes the samples' temperature or volume (as manipulated by previous activities).

jakebeal commented 2 years ago

@danbryce The current implementation of SampleArray is a "roll our own" RDF, so it will be nice to get away from it when we can. It sounds like your preference is for numpy, and I would support that.

I also want to argue in favor of the Boolean array model for SampleMask because that's the only way we'll be able to support dynamic masking based on intermediate results of protocols, such as checking for samples that have grown well or samples that are showing an expected fluorescence. These are common operations in protocols. For example, they are needed for most assembly protocols, including GoldenGate.

With regards to adding sample descriptions: my suggestion there was that we simply have an array of URIs, which then map to SBOL Component objects. We know from the prior work with Intent Parser that these should suffice for describing samples.

danbryce commented 2 years ago

SampleArray as numpy would require a string version of the numpy array, I think. Metadata would be a parallel array of references to SBOL Components. Its a little weird to use implicit indices in the stringified numpy array and explicit indices for the metadata as an RDFified array. However, I’m willing to live with that. The RDF needs to point into whatever opaque format we use, such as stringified numpy.

Its also a little weird to use a different representation of the SampleArray (stringified numpy) and SampleMask (PrimitiveArray).

On Feb 28, 2022, at 10:42 AM, Jacob Beal @.***> wrote:

@danbryce The current implementation of SampleArray is a "roll our own" RDF, so it will be nice to get away from it when we can. It sounds like your preference is for numpy, and I would support that.

I also want to argue in favor of the Boolean array model for SampleMask because that's the only way we'll be able to support dynamic masking based on intermediate results of protocols, such as checking for samples that have grown well or samples that are showing an expected fluorescence. These are common operations in protocols. For example, they are needed for most assembly protocols, including GoldenGate.

With regards to adding sample descriptions: my suggestion there was that we simply have an array of URIs, which then map to SBOL Component objects. We know from the prior work with Intent Parser that these should suffice for describing samples.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

jakebeal commented 2 years ago

Can't we use numpy arrays for all of them?

rpgoldman commented 2 years ago

Do numpy arrays have a serialization that can be used by, e.g., R, Julia?

danbryce commented 2 years ago

Yes, that would be fine too. They have to/from string methods that should allow us to serialize.

On Feb 28, 2022, at 11:29 AM, Jacob Beal @.***> wrote:

Can't we use numpy arrays for all of them?

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

jakebeal commented 2 years ago

@rpgoldman Yes - the strings they produce are pretty simple to parse, and at least some languages (like R) already have library support.

rpgoldman commented 2 years ago

@rpgoldman Yes - the strings they produce are pretty simple to parse, and at least some languages (like R) already have library support.

OK, that was my only worry.

danbryce commented 2 years ago

I looked into numpy a little more, and it is both good and bad. It has a binary format npy that you can save/load. It is supposed to support everything that you can do with a numpy array, but seems to have spotty support outside of python. The supposed benefit of using this binary format is to optimize for data i/o speed. While that could be a concern for us, I think a clear format would be better (more portable). For that, we can use str(). This would give us something like JSON arrays. Any strong feelings about binary vs clear?

jakebeal commented 2 years ago

Definitely clear, because the density advantage isn't that much and only a few languages support .npy. Anything really large should be a sidecar file anyway.

Bioprotocols / labop

Representation of array data #113