TGSAI / mdio-python

Cloud native, scalable storage engine for various types of energy data.
https://mdio.dev/
Apache License 2.0
32 stars 11 forks source link

ENH: Value in `MDIO__IMPORT__CPU_COUNT` does not get used for ingestion in a particular scenario #427

Open amitpendharkar opened 3 weeks ago

amitpendharkar commented 3 weeks ago

Issue

The value in environment variable MDIO__IMPORT__CPU_COUNT should be used to limit number of processed spawned by ProcessPoolExecutor. However, this value does not get used in a particular situation.

In following two scenarios, it get's used in scenario 1 but not in scenario 2. Scenario 1: Set the environment variable MDIO__IMPORT__CPU_COUNT and then run the script that invokes segy_to_mdio function : Works E.g. Launch a pod with environment variable MDIO__IMPORT__CPU_COUNT already set and then run the script

Scenario 2: Run the script, set environment variable MDIO__IMPORT__CPU_COUNT in that script and then invoke segy_to_mdio function : Does not work E.g. Launch a pod. Use argument sent to the script to set the environment variable MDIO__IMPORT__CPU_COUNT in the script and then invoke the segy_to_mdio function

This happens because NUM_CPUS value gets updated when the code is loaded in memory before execution starts thus requiring environment variable to be set before running the script.

Suggested solution

Re-read the value of MDIO__IMPORT__CPU_COUNT just before following line and save it in NUM_CPUs. This will ensure that Scenario 2 would also work. https://github.com/TGSAI/mdio-python/blob/14757936e914589283233dedd9f55f87b1a95a6f/src/mdio/segy/blocked_io.py#L122

tasansal commented 3 weeks ago

Hi Amit; thanks for sharing this. I'll try to find a more clean way that satisfies both.

In the meantime if you set it with os.environ before you import mdio functions it should still work for scenario 2. Can you please try and let me know?