Kawue / imzML-to-HDF5

Small parser to convert a imzML file into an HDF5 file.
1 stars 1 forks source link

memory error #1

Open kailaw138 opened 9 months ago

kailaw138 commented 9 months ago

computer has already equipped with 64 Gb RAM, but still does not work:

HD5

any fix, please.

Kawue commented 9 months ago

I am not working in this area anymore for quite some time now. However, I will try to fix your problems.

How large is your data set? 64Gb might not be enough since there is no smart batch procedure. I always worked on larger compute clusters.

kailaw138 commented 9 months ago

in this case, the imzML file is 22.5 MB, the ibd file is 53.9 GB (apparently total > 64 GB, windows also consumes 12 GB)

(spatial resolution was only 100 um, and i would like to increase that to 50 um, so even larger files as i progress my study)

Thank you for fixing that issue and looking forward to that.

Kawue commented 9 months ago

If you have some smaller files it would be interesting to see if the program works flawless with these. However, for such big files you have to use a compute cluster. The way HFD5 is used in this method is non-sparse, meaning that the resulting data is likely to be larger in size than your imzML.

Sorry but there is no easy way to fix this. Using files of that size on a single consumer computer would require a completely new approach of handling the data. However, if you can get access to a compute cluster in your institution that problem should be solved.

Edit: just for completeness, when I talk about the imzML I always refer to the combination of imzml + ibd

kailaw138 commented 9 months ago

Unnnn..... I have tested small files, and still does not work:

Capture

the files are small:

Capture2

PS. The computer I am using is not a consumer PC, it is a server computer/workstation, with 20 physical cores (40 with hyperthreading), 64 GB RAM and 32 TB storage. If it does not work on this computer, it is not going to work in windows at all, perhaps including windows server.

Any alternative to convert the files to h5?

Kawue commented 9 months ago

Its hard to investigate this error remotely without data. If you are allowed to send me this test file I can try to investigate the error. There is no alternative to HDF5 for my methods. However, the code is fairly old. I am quite confident, that there are better methods out there by now. If not and you really need this code I will still try to help you to resolve the issues. But without test data this will be nearly impossible.

kailaw138 commented 9 months ago

The data above is a demo data set from massPix: an R package for annotation and interpretation of mass spectrometry imaging data for lipidomics. Metabolomics (2017) 13:128, available at:

https://www.ebi.ac.uk/metabolights/editor/MTBLS487/files http://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS487/

Google search only leads to two methods to convert imzML to h5, one is your imzML-to-HDF5, and the other is pyImagingMSpec (which is older than yours). Code, doc and conda as follow:

https://github.com/alexandrovteam/pyImagingMSpec https://pyimagingmspec.readthedocs.io/en/latest/pyImagingMSpec.convert.html#pyimagingmspec-convert-h5-module https://anaconda.org/bioconda/pyimagingmspec

Kawue commented 9 months ago

I played w bit with the test data. There is actually no problem with the code. Its still a size problem of the HDF5. Thats a major problem with high resolution data. So you have 10471869 m/z-values based on the given resolution and 1701 pixel coordninates. Each number is encoded as int64, which requires 8 Bit.

10471869 1701 8 = 142501193352 Bit ((142501193352 / 1000) / 1000) / 1000 = ~142,501 GB

We could reduce it to ~71,250 GB by using int32 or even further to ~35,625 by using int16. My local System here has 32GB of RAM so I can not test this for you. If the number range of int16 you enough for your data I can guide you how to adjust this part of the code.

However, for data with that resolution it is advisable to use much smarter storage systems than in-memory HDF5. On the other hand, if you just throw with enough memory the code should still run.

kailaw138 commented 9 months ago

Thank you again for your support. I am glad to have your instructions to adjust your code.

On the other hand, I have managed to use pyimzML and h5py to achieve the conversion without memory error:

Web capture_14-12-2023_154337_localhost

although I am not sure that it has converted the file correctly (the structure is different from the demo I got):

HDFViewer

(the first one is a demo h5, the lower one is my h5 using the above study)

Kawue commented 9 months ago

That conversion should be totally fine I guess.

However, this format of yours should be quite complicated to work with if you want to map one m/z-value to a intensity map. On the other hand, that format should require lot less space since you have no need to fill a matrix up with zeros. So your conversion is much more dense than mine.

Kawue commented 9 months ago

Now that I thought a bit about it, I think using sqlite instead of hdf5 could have been a better choice.

Although this is not a big help for your current problem.