Closed LachlanStuart closed 4 years ago
@intsco Please review the latest 3 commits. I checked every imzML file in our main upload bucket and found there were 27 imzML files that had mismatched accession/name values for the datatype of the m/z and intensity arrays. This caused the code in this PR to read corrupt data and usually crash due to reading past the end of the .ibd file. Specifically, these were due to an early version of ImzMLWriter from this library, and an early version of Xcalibur.
I suspect many datasets will generate warnings. It'll be spammy, but I'd prefer warnings over hard-to-debug issues... I tested affected datasets from ImzMLWriter and Xcalibur, and got this output:
> p = ImzMLParser('/home/lachlan/Documents/datasets/Untreated_3_434.imzML')
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:81: UserWarning: Accession MS:1000523 found with incorrect name "32-bit float" (expected "64-bit float"). This is a known issue with some imzML conversion software - updating accession to MS:1000521.
'to %s.' % (accession, raw_name, name, fixed_accession)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:73: UserWarning: Unrecognized accession in <cvParam>: MS:xxx (name: "pyimzml").
warn('Unrecognized accession in <cvParam>: %s (name: "%s").' % (accession, raw_name))
> p = ImzMLParser('/home/lachlan/data/old_xcalibur_dataset.imzML', parse_lib='ElementTree')
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession MS:1000563 found with incorrect name "Thermo RAW file". Updating name to "Thermo RAW format".
% (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession MS:1000590 found with incorrect name "contact organization". Updating name to "contact affiliation".
% (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:81: UserWarning: Accession MS:1000521 found with incorrect name "64-bit float" (expected "32-bit float"). This is a known issue with some imzML conversion software - updating accession to MS:1000523.
'to %s.' % (accession, raw_name, name, fixed_accession)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession IMS:1000042 found with incorrect name "max count of pixel x". Updating name to "max count of pixels x".
% (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession IMS:1000043 found with incorrect name "max count of pixel y". Updating name to "max count of pixels y".
% (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession IMS:1000046 found with incorrect name "pixel size x". Updating name to "pixel size (x)".
% (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession MS:1000838 found with incorrect name "sprayed". Updating name to "sprayed MALDI matrix preparation".
% (accession, raw_name, name)
I also found that the ontology dumps excluded obsolete terms. This generated spurious warnings as some terms were present in the above files (such as MS:1000843 wavelength), so I re-dumped the ontologies with these obsolete terms included.
This PR adds support for extracting metadata at two levels:
Metadata
class, and exposed viaImzMLParser.metadata
.<spectrum>
elements. Due to performance, there are several options for parsing this data:ImzMLParser.spectrum_metadata_fields
, which would have e.g. for TIC values (accession MS:1000285):{'MS:1000285': [123.456, 123.456, ...]}
. This takes ~5% extra time per accession ID.SpectrumData
class per spectrum. These are exposed via the listImzMLParser.spectrum_full_metadata
. This takes ~2.5x as much time as the default mode. The 2rd option is what I intend to use for METASPACE to extract TIC and injection time. I added the 3rd option for completeness, as this library exists for more than just METASPACE.In order to provide a good interface to the data:
Metadata.__init__
andSpectrumData.__init__
destructure the XML element hierarchy defined by the mzML spec into Python objects, dicts and lists. There was an extremely common pattern to hold a collection of Controlled Vocabulary and user-defined values, which I implemented asParamGroup
.ontology/ms.py
, etc. dumps of the.obo
ontology files provide a mapping of accession IDs to spec-defined names and data types. This ensures that the dicts of Controlled Vocabulary values can be accessed consistently, even if implementations have fields with typos, etc.ParamGroup
, which is a lot less "lossy". I feel this was required to handle edge cases (e.g. multiple definitions with the same accession ID) and other use cases (e.g. retrieving units).There are several missing features of this implementation:
MS:1000129 negative scan
parameter with no value. Ideally this would be retrievable as a value of theMS:1000465 scan polarity
parameter (which is never explicitly used). I.e. at the moment you have to ask "is this positive mode? is this negative mode? etc." for ever possible enum value, when it would be preferable to have one field so that you could ask "what mode is it?". Frankly, it was too time-consuming to get a useful list of these relationships out of the ontology data due to inconsistencies about how the relationships were defined.source_file.location
instead ofsource_file.attrs.get('location')
, but it seemed too low value / high cost to implement. Also, unlikesubelements
, XML attributes aren't converted from camelCase to snake_case.units
that can be added to parameters aren't exposed in any useful high-level manner. This is a rabbit hole I'd prefer not to explore until I have a solid need for it.See
tests/test_basic.py
for examples of usage with the new API. There is also aMetadata.pretty()
function which dumps a human-readable JSON-ifiable dict of the data, with output like this: https://gist.github.com/LachlanStuart/343f2b42815a3c64e15308a200ab91c9In addition to the above changes, I updated the project metadata a bit, including dropping support for Python 2.7, because the code was already using language features not available in 2.7 and nobody has complained.
For testing, I ran this against ~30 imzML files from various sources as a stability check to ensure there were no crashes introduced by the metadata parsing. I'm not 100% confident that all the mappings were done correctly - the checks in
tests/test_basic.py
were limited by which fields/sections were available in the test datasets. I only did spot checks on a couple other fields from other datasets.