alexandrovteam / pyimzML

A parser to read .imzML files with python
Apache License 2.0
32 stars 19 forks source link

Add support for parsing all imzML metadata fields #21

Closed LachlanStuart closed 4 years ago

LachlanStuart commented 4 years ago

This PR adds support for extracting metadata at two levels:

In order to provide a good interface to the data:

There are several missing features of this implementation:

See tests/test_basic.py for examples of usage with the new API. There is also a Metadata.pretty() function which dumps a human-readable JSON-ifiable dict of the data, with output like this: https://gist.github.com/LachlanStuart/343f2b42815a3c64e15308a200ab91c9

In addition to the above changes, I updated the project metadata a bit, including dropping support for Python 2.7, because the code was already using language features not available in 2.7 and nobody has complained.

For testing, I ran this against ~30 imzML files from various sources as a stability check to ensure there were no crashes introduced by the metadata parsing. I'm not 100% confident that all the mappings were done correctly - the checks in tests/test_basic.py were limited by which fields/sections were available in the test datasets. I only did spot checks on a couple other fields from other datasets.

LachlanStuart commented 4 years ago

@intsco Please review the latest 3 commits. I checked every imzML file in our main upload bucket and found there were 27 imzML files that had mismatched accession/name values for the datatype of the m/z and intensity arrays. This caused the code in this PR to read corrupt data and usually crash due to reading past the end of the .ibd file. Specifically, these were due to an early version of ImzMLWriter from this library, and an early version of Xcalibur.

I suspect many datasets will generate warnings. It'll be spammy, but I'd prefer warnings over hard-to-debug issues... I tested affected datasets from ImzMLWriter and Xcalibur, and got this output:

> p = ImzMLParser('/home/lachlan/Documents/datasets/Untreated_3_434.imzML')                                           
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:81: UserWarning: Accession MS:1000523 found with incorrect name "32-bit float" (expected "64-bit float"). This is a known issue with some imzML conversion software - updating accession to MS:1000521.
  'to %s.' % (accession, raw_name, name, fixed_accession)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:73: UserWarning: Unrecognized accession in <cvParam>: MS:xxx (name: "pyimzml").
  warn('Unrecognized accession in <cvParam>: %s (name: "%s").' % (accession, raw_name))

> p = ImzMLParser('/home/lachlan/data/old_xcalibur_dataset.imzML', parse_lib='ElementTree')                                                                                                      
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession MS:1000563 found with incorrect name "Thermo RAW file". Updating name to "Thermo RAW format".
  % (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession MS:1000590 found with incorrect name "contact organization". Updating name to "contact affiliation".
  % (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:81: UserWarning: Accession MS:1000521 found with incorrect name "64-bit float" (expected "32-bit float"). This is a known issue with some imzML conversion software - updating accession to MS:1000523.
  'to %s.' % (accession, raw_name, name, fixed_accession)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession IMS:1000042 found with incorrect name "max count of pixel x". Updating name to "max count of pixels x".
  % (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession IMS:1000043 found with incorrect name "max count of pixel y". Updating name to "max count of pixels y".
  % (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession IMS:1000046 found with incorrect name "pixel size x". Updating name to "pixel size (x)".
  % (accession, raw_name, name)
/home/lachlan/dev/pyimzML/pyimzml/ontology/ontology.py:88: UserWarning: Accession MS:1000838 found with incorrect name "sprayed". Updating name to "sprayed MALDI matrix preparation".
  % (accession, raw_name, name)

I also found that the ontology dumps excluded obsolete terms. This generated spurious warnings as some terms were present in the above files (such as MS:1000843 wavelength), so I re-dumped the ontologies with these obsolete terms included.