TGSAI / mdio-python

Cloud native, scalable storage engine for various types of energy data.
https://mdio.dev/
Apache License 2.0
37 stars 13 forks source link

ENH: Reduce number of processes spawned by `ProcessPoolExecutor` in `blocked_io.py` #426

Open amitpendharkar opened 3 months ago

amitpendharkar commented 3 months ago

Issue: ProcessPoolExecutor spawns the processes based on max_workers argument. Total number of logical cpus are currently used as default value for max_workers https://github.com/TGSAI/mdio-python/blob/14757936e914589283233dedd9f55f87b1a95a6f/src/mdio/segy/blocked_io.py#L124

This spawns too many processes. Based of the size of the segy file, resource contention and context switching bogs the machine down and eventually the run get's terminated.

Suggested solution: Make default value of max_workers as either number of cores or limit it to 80% of available logical cpus.(discuss)

tasansal commented 3 months ago

Hi, thanks for looking into it. The max cpus is a good approximate for normal sized machines. If we are using a large machine with 48+ cores I can see where this could start trashing the machine.

MDIO takes the following environment variable to control number of CPUs used in ingestion: MDIO__IMPORT__CPU_COUNT. I propose, instead of changing the default, you should set this value based on your machine.

amitpendharkar commented 3 months ago

We noticed this issue only after we saw some random ingestion failures. Out of two similar files, one fails and other other succeeds. The failed one would sometimes succeed if re-submitted multiple times. This causes loss of time in production.

I believe setting the value to 80% of available logical cpus would take care of the issue. It would also leave some room to other processes (like opentelemetry) to run.

amitpendharkar commented 3 months ago

I believe, setting MDIO_IMPORT_CPUT_COUNT when pod get's launched makes things really unclear. One would not know why it's only using few resources. Coming up with right value for MDIO_IMPORT_CPUT_COUNT is also difficult.

Also, setting MDIO_IMPORT_CPUT_COUNT should not be the norm. It should be a one off thing.

We can make default_cpus = cpu_count(logical=False) below and other such places. https://github.com/TGSAI/mdio-python/blob/14757936e914589283233dedd9f55f87b1a95a6f/src/mdio/segy/blocked_io.py#L40

FYI: We routinely use 48+ core machines in production

I will update here with some benchmarking results to validate the change.