fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Memory usage for pandas.read_xml #362

Closed tcompa closed 1 year ago

tcompa commented 1 year ago

We have examples where create-ome-zarr goes out of memory when the limit is set to 1G or 2G, e.g. for a XML file of 160k lines (see https://github.com/fractal-analytics-platform/fractal-server/issues/599#issuecomment-1503259444).

Maybe it's worth checking that we are using pandas.read_xml correctly. We can quickly debug the memory usage of this function, and possibly look around for known issues (https://github.com/pandas-dev/pandas/issues/45442 - possibly related?).

If all looks reasonable on the XML-parsing side, should we set some more generous default memory in the manifest? It's a non-parallel task, and it should be simple for SLURM to schedule it even if it requires 4G.

jluethi commented 1 year ago

I wouldn't invest too much time into this. 4G for parsing a ~1 Mio file microscope acquisition is not unreasonable. See:

The XML file for the full 23 well example is waaaay bigger than the tiny examples. So it's not unreasonable that they could be a bit more memory hungry. And if that's the case, the 23 well example is probably close to an upper bound of xml sizes we'd normally hit. It's not many wells, but imaging for ~14h, something on the order of a million images (=> a million lines in the xml file). Thus, we may want to adjust the default memory to be something like 4G for the Create OME Zarr task as well, after all.

=> let's increase this default to 4G

It likely could be optimized further, but the potential gain is not really worth the time investment at the time being.

tcompa commented 1 year ago

Now updated in fractal-tasks-core 0.9.2.