Memory usage for pandas.read_xml

tcompa commented 1 year ago

We have examples where create-ome-zarr goes out of memory when the limit is set to 1G or 2G, e.g. for a XML file of 160k lines (see https://github.com/fractal-analytics-platform/fractal-server/issues/599#issuecomment-1503259444).

Maybe it's worth checking that we are using pandas.read_xml correctly. We can quickly debug the memory usage of this function, and possibly look around for known issues (https://github.com/pandas-dev/pandas/issues/45442 - possibly related?).

If all looks reasonable on the XML-parsing side, should we set some more generous default memory in the manifest? It's a non-parallel task, and it should be simple for SLURM to schedule it even if it requires 4G.

jluethi commented 1 year ago

I wouldn't invest too much time into this. 4G for parsing a ~1 Mio file microscope acquisition is not unreasonable. See:

The XML file for the full 23 well example is waaaay bigger than the tiny examples. So it's not unreasonable that they could be a bit more memory hungry. And if that's the case, the 23 well example is probably close to an upper bound of xml sizes we'd normally hit. It's not many wells, but imaging for ~14h, something on the order of a million images (=> a million lines in the xml file). Thus, we may want to adjust the default memory to be something like 4G for the Create OME Zarr task as well, after all.

=> let's increase this default to 4G

It likely could be optimized further, but the potential gain is not really worth the time investment at the time being.

tcompa commented 1 year ago

Now updated in fractal-tasks-core 0.9.2.

fractal-analytics-platform / fractal-tasks-core

Memory usage for pandas.read_xml #362