alexandrovteam / pyimzML

A parser to read .imzML files with python
Apache License 2.0
32 stars 18 forks source link

Orbitrap data crash #13

Closed Kawue closed 5 years ago

Kawue commented 5 years ago

I tried to read the imzML from a instrument called "MALDI/ESI Injector (Spectroglyph) coupled to Thermo's Orbitrap Q Exactive Plus." Parsing crashes without any error message. To avoid memory problems I also tried to parse the data on a cluster with 1TB of RAM but it still crashes. The error code I get from VSCode is roughly connected to a memory heap problem. Is there a known fix?

intsco commented 5 years ago

Hi Kawue, how big are the imzml and ibd files? Did you try to measure the memory usage (with any GUI tool or with htop on LInux) while reading those files?

Kawue commented 5 years ago

Hey Vitaly, my tested ibd is around 700MB in 32bit precision and 1,4GB in 64bit, the respective imzML is around 38MB, with ~27.000 spectrum entries (unbinned). This is the smallest one that I currently have. I have also bigger ones up to 3.5GB. I tried to print the memory usage via psutil during the loop where everything crashes, which is the xml loop in __iter_read_spectrum_meta() but it always stood below 100MB. I tried to debug, but the loop just crashes at some point (always the same point). If I reduce the imzml to ~740 spectras, simply by deleting them it works fine. I also tried to delete from the beginning, the end and around this 740 point to exclude data set problems. While the VSCode error is related to a heap problem or a heap overflow (local 64bit Windows machine with 32GB RAM) the cluster responded with "corrupted size vs. prev_size", which is a 64bit Linux machine with a maximal RAM usage of 400GB / 1TB.

This is everything I tried so far and every information I have. If you need more information or if I have to execute some specific tests, just tell me.

intsco commented 5 years ago

The file sizes don't look huge. On metaspace2020.eu we have got more than 3k datasets read by pyimzML. I suspect there might be something to do with a particular file or the imzml export software. Ideally, you'd need to share with us the smallest dataset that still crashes pyimzML so we can reproduce this behaviour.

Kawue commented 5 years ago

Sure, any preference where to upload the data so you can easily access it?

intsco commented 5 years ago

Feel free to use any cloud storage like Dropbox or Amazon S3. Just send me a link via email

intsco commented 5 years ago

There is something wrong with memory allocation inside Python interpreter. I couldn't figure out if it's a bug in the pyimzML or in the imzml files. So far, I can offer only a workaround. Installing pyimzML with Python>=3.6 solved the problem, in my case at least.

Kawue commented 5 years ago

Now, as you write memory allocation I remember that somewhere I got that error too. My former tests were all in Python 3.7. I tested Python 3.6 and 3.5 now and it works as intended. So I guess you meant Python<=3.6. As long as this problem still exists you may want to include a compatibility note in your readme.

Thanks a lot for your help!

Kawue commented 5 years ago

Same error appears again. This time I tried Python 3.5, 3.6 and 3.7. I tested two orbitrap data sets and it appears for both. I am still not sure if it is a data set or a pyimzML problem.

intsco commented 5 years ago

Hi @Kawue , Just to double check, you are able to import these imzML files into some other software and can browse the data, right?

Kawue commented 5 years ago

I only tried the vendor software and it works there. Every other software I currently use relies on your package. My colleagues of the biological department tried rMSI, imzMLValidator and msiQuant without success. They will try cardinal today as well.

Kawue commented 5 years ago

The newest version of cardinal works.

intsco commented 5 years ago

@Kawue to reproduce the issue we'll need to have some data. Would you mind sharing with us the smallest dataset you have? It would also be useful to know if you can read the files with https://msireader.wordpress.ncsu.edu/

Kawue commented 5 years ago

Although it tells me there is an UUID problem MSiReader is able to read the imzML. I will send you the smallest test data set via mail.

intsco commented 5 years ago

@Kawue It turned out there is an issue with the XML-parsing library we use by default. Try to install a dev version of the package from Github pip install -e git+https://github.com/alexandrovteam/pyimzML@feature/iterparse-choice#egg=pyimzML And check if specifying a different library solves the issue for your datasets:

from pyimzml.ImzMLParser import ImzMLParser
parser = ImzMLParser(filename, parse_lib='ElementTree')
Kawue commented 5 years ago

I did a few tests and it seems to work. Thanks for you help!