HUPO-PSI / psi-ms-CV

HUPO-PSI mass spectrometry CV
Other
28 stars 38 forks source link

PSI-MS example file #82

Closed davidshumway closed 3 years ago

davidshumway commented 3 years ago

Describe the question or discussion

Hi,

Is there a small example file using the PSI-MS ontology? I'm interested in how the binary data array class is populated, specifically for m/z array and intensity array. For example, the m/z array has the relationship has_units->m/z. So there is the m/z class. But I don't see a class specifying the definition of "array". Would something along the lines of rdf:Seq (ordered) suffice?

Cheers David

mobiusklein commented 3 years ago

For lack of a better way to describe parallel arrays semantically, yes rdf:Seq conveys the "order matters" concept.

For an example of how those terms from the PSI-MS ontology is used formally, see https://github.com/HUPO-PSI/mzML/blob/master/examples/1min.mzML#L178. Or a more recent example, https://github.com/mobiusklein/ms_deisotope/blob/master/ms_deisotope/test/test_data/only_ms2_mzml.mzML#L102. These both use the mzML format, which stores the raw numerical data, optionally compressed with zlib, as a base64 encoded string in an XML file, requiring some decoding. The same goes for the intensity array marked element adjacent to the m/z array element.

Usually, m/z array is ordered by m/z coordinate of a mass spectrum, and if that is present, it implies that all other arrays are ordered w.r.t. the m/z array. However, this is not required by either the PSI-MS controlled vocabulary or the mzML format specification. Additionally, not all spectra are mass spectra, and so the coordinate array may be a different value, e.g. wavelength array for photosensor arrays, or time array for chromatograms. There is usually another term that indicates what the type of the entity being described in the parent XML element.

Additionally, nothing precludes a chromatogram from having both an m/z array and a time array, but in such a case, the ordering relationship would be expected to be w.r.t. the time array, but this is entirely up to the reader, with no formal method of conveying this expectation from the writer.

There are also arrays that explicitly are not coupled to that default coordinate array, like sampled m/z noise array.

Does this answer your question, or miss it completely?

davidshumway commented 3 years ago

Yes. Thanks for pointing out example files!

davidshumway commented 3 years ago

Out of curiosity and perhaps in another direction, does the PSI-MS ontology allow for representation of a spectrum in plain text rather than binary? Two types of spectra come to mind: Either 1) a preprocessed spectra with let's say 1300 m/z + intensity values, or 2) just the important peaks of a spectra, let's say 50 m/z + intensity values. To go a little further, I'm curious whether storage of individual peaks, in a sequential format is possible such that it would be possible to query the RDF for a single peak of interest. Perhaps this is not the usual or preferred method of accessing most spectra, that's why I mention my curiousity on the subject.

mobiusklein commented 3 years ago

The ontology itself makes no specification about how data are rendered, that's up to a file format specification. In the case of mzML, it was a matter of efficiency to use binary encoding instead of text for whole arrays. There are JSON formats where instead of representing them in binary, each array is a JSON Array of Numbers, but these are not expected to be used to serialize entire LC-MS/MS experiments at once. There is still an m/z array and an intensity array instead of a single array of objects with an m/z and an intensity attribute though.

There aren't terms meant for describing individual peaks/ions in isolation without them being special in some way (e.g. a precursor or "selected ion") or having already been identified. You could use children of ion selection attribute to denote the m/z and intensity of individual peaks, but that would not be their intended use.

You could store individual peaks sequentially in a database, but you end up paying storage overhead to connect them to the semantically related objects. There are several independent data engineering trade-offs to make then, but if they make sense for your application, you're free to do so. There are difficulties associated with modeling peaks from being to floating point numbers that make a generic semantic relationship modeling system like RDF less helpful.

davidshumway commented 3 years ago

It seems perhaps a little surprising that there are no use cases so far about including semantic access to individual peaks in raw and/or preprocessed spectra.

chambm commented 3 years ago

MzIdentML has some product ion info that can be optionally embedded, but it makes the files ginormous so it's rare to see it. MzSpecLib will certainly have annotated product ions. But maybe I'm not really understanding what you mean by "semantic access to individual peaks".

davidshumway commented 3 years ago

Actually I think I may be misunderstanding PSI-MS. As an ontology, I assumed data (i.e. instances) could be added, creating i.e. a "knowledge graph".

Suppose the goal is to query for a range of peaks. This can be achieved in relational databases. For example, in postgres, by using the between operators: select * from x where value >= 0 and value <= 1.

I suppose my misunderstanding is that PSI-MS can be used as an ontology to the extent instance data can be added and retrieved.

mobiusklein commented 3 years ago

My impression was @davidshumway was looking for a way to search for an experimental peak (not an identification), and to say it in SPARQL:

PREFIX MS: <https://www.psidev.info/psi-ms/4.1.58#>

SELECT ?peak, ?scan, ?dataset 
WHERE {
  ?peak MS:peak_of ?scan .
  ?scan MS:scan_of ?dataset .
  ?mz MS:mz_of ?peak .
  FILTER( abs(?mz - 304.27) < 0.01 )
}

However, PSI-MS CV lacks terms for those relationships. For the most part, they are relationships implicit in how the file formats are organized. For that matter, mzML doesn't concern itself with "peaks", it stores "peak data", which are not individually addressable.

Why peaks aren't addressable is probably a use-case thing, but implementation-wise it becomes tricky to do it uniformly across all data types (too much is dependent on context). Furthermore, as @chambm brought up at the PSI conference a year or two ago, the spectrum we store is not necessarily the spectrum we analyze, e.g. spectrum summing/averaging, peak filtering, or even just signal processing up-front (peak picking, deisotoping).

davidshumway commented 3 years ago

That SPARQL query pretty much captures the functionality I'm considering. Thanks, @mobiusklein!

mobiusklein commented 3 years ago

This isn't to say we can't add those terms, it's just that they don't have existing users, and any attempt to use them at scale is going to need to be clever about it, and make it clear that the design goal is to capture peak-picked data.

I'm sure the graph database community can prove me wrong about there not being an efficient way to store the heterogenous predicates, but when I tried to do this with SQLite five years ago it was very slow to write, but reasonably good for reading, at scale. This is especially because of all the indices that need to be maintained to make searches fast.

mobiusklein commented 3 years ago

@davidshumway is this something you want to discuss further, or is it safe to close this issue?

edeutsch commented 3 years ago

Closing for now. Reopen if necessary.