Open amitpendharkar opened 3 months ago
Hi, thanks for looking into it. The max cpus is a good approximate for normal sized machines. If we are using a large machine with 48+ cores I can see where this could start trashing the machine.
MDIO takes the following environment variable to control number of CPUs used in ingestion: MDIO__IMPORT__CPU_COUNT
.
I propose, instead of changing the default, you should set this value based on your machine.
We noticed this issue only after we saw some random ingestion failures. Out of two similar files, one fails and other other succeeds. The failed one would sometimes succeed if re-submitted multiple times. This causes loss of time in production.
I believe setting the value to 80% of available logical cpus would take care of the issue. It would also leave some room to other processes (like opentelemetry) to run.
I believe, setting MDIO_IMPORT_CPUT_COUNT
when pod get's launched makes things really unclear. One would not know why it's only using few resources. Coming up with right value for MDIO_IMPORT_CPUT_COUNT
is also difficult.
Also, setting MDIO_IMPORT_CPUT_COUNT
should not be the norm. It should be a one off thing.
We can make
default_cpus = cpu_count(logical=False)
below and other such places.
https://github.com/TGSAI/mdio-python/blob/14757936e914589283233dedd9f55f87b1a95a6f/src/mdio/segy/blocked_io.py#L40
FYI: We routinely use 48+ core machines in production
I will update here with some benchmarking results to validate the change.
Issue:
ProcessPoolExecutor
spawns the processes based on max_workers argument. Total number of logical cpus are currently used as default value for max_workers https://github.com/TGSAI/mdio-python/blob/14757936e914589283233dedd9f55f87b1a95a6f/src/mdio/segy/blocked_io.py#L124This spawns too many processes. Based of the size of the segy file, resource contention and context switching bogs the machine down and eventually the run get's terminated.
Suggested solution: Make default value of max_workers as either number of cores or limit it to 80% of available logical cpus.(discuss)