equinor / everest-models

GNU General Public License v3.0
1 stars 2 forks source link

Limiting the number of ResInsight instances to avoid Port number error #46

Open ffel10 opened 1 month ago

ffel10 commented 1 month ago

1) Problem statement : Currently, approximately 25% of my Everest well trajectory realizations fail due to the following error: 'ConnectionError: Failed to launch ResInsight.' The ResInsight error message states: 'Launching as console app. Port number file retry count: 60. Unable to read port number. Launch failed.'

2) I have reproduced the error using a separate script. If the script launches a large number of ResInsight instances simultaneously, ResInsight tends to fail launching (exceeds the Port number file retry count: 60). I can resolve this by limiting the number of active ResInsight instances, similar to the code below.

3) Possible solution: I suggest, for instance, a modification to _everest_models/jobs/fm_welltrajectory.py as suggested below.

class ResInsight:
    # Create a semaphore with a maximum of, for instance, 20 permits to limit simultaneous instances
    _instance_semaphore = threading.Semaphore(20)

    def __init__(self, executable: Path) -> None:
        self._executable = executable
        signal.signal(signal.SIGTERM, lambda *_: sys.exit(0))
        signal.signal(signal.SIGINT, lambda *_: sys.exit(0))

    def __enter__(self) -> rips.Instance:
        # Acquire a semaphore permit before launching the instance
        self._instance_semaphore.acquire()
        retry_count = 0
        while retry_count < 3:  # Allows up to 2 retries
            try:
                instance = rips.Instance.launch(str(self._executable), console=True)
                if instance is None:
                    retry_count += 1
                    if retry_count >= 3:
                        raise ConnectionError("Failed to launch ResInsight after 3 attempts.")
                else:
                    self._instance = instance
                    return instance
            except Exception:
                if retry_count < 1:
                    retry_count += 1
                else:
                    # Release the semaphore if launching fails after retrying
                    self._instance_semaphore.release()
                    raise
        # If loop exits without returning, release semaphore
        self._instance_semaphore.release()
        raise ConnectionError("Failed to launch ResInsight.")
ffel10 commented 1 month ago

As requested, the Python code to "reproduce" the ResInsight launching error. If Max_sim_running = 20, I obtain zero failures. On the other hand, for instance, for Max_sim_running i= 200, I obtained 178 launching errors.

import logging
from pathlib import Path
import rips
import threading
from threading import Semaphore, Timer

Total_attempts=200
Max_sim_running=20

# Semaphore to limit to concurrent instances
instance_semaphore = threading.Semaphore(Max_sim_running)

def close_resinsight_instance(instance):
    """Function to close the ResInsight instance."""
    if instance:
        instance.exit()
        #print("ResInsight instance closed automatically after 5 seconds.")

def launch_resinsight_instance(executable_path: str, attempt: int, results: dict):
    print(f"Attempt {attempt + 1} to launch ResInsight...")
    with instance_semaphore:
        try:
            instance = rips.Instance.launch(executable_path, console=True)
            if instance is None:
                print(f"Failed to launch ResInsight on attempt {attempt + 1}")
                results[attempt] = False
            else:
                results[attempt] = True
                # Schedule the instance to close after 5 seconds
                timer = threading.Timer(5, close_resinsight_instance, [instance])
                timer.start()
        except Exception as e:
            print(f"Error on attempt {attempt + 1}: {e}")
            results[attempt] = False

def test_resinsight_launches(executable_path: str, attempts: int = Total_attempts) -> int:
    failed_attempts = 0
    results = {}
    threads = []

    for i in range(attempts):
        thread = threading.Thread(target=launch_resinsight_instance, args=(executable_path, i, results))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    failed_attempts = sum(1 for success in results.values() if not success)
    return failed_attempts

if __name__ == "__main__":
    executable_path = "/prog/ResInsight/current/ResInsight"
    failed_attempts = test_resinsight_launches(executable_path)
    print(f"Total failed attempts: {failed_attempts}")
ffel10 commented 1 month ago

By the way, a comment from the ResInsight developers. "ResInsight use GRPC as the communication protocol, and I found this link indicating a default of 100 concurrent sessions."

https://learn.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-8.0

sondreso commented 4 weeks ago

Hi!

Limiting the number of simultaneously running instances by using a semaphore will unfortunately not work when the jobs are distributed onto a compute cluster (as they are not in the same process or even on the same machine). However, the number of concurrently running simulations can be limited by setting the cores option under the simulator section (See: https://fmu-docs.equinor.com/docs/everest/config_reference.html#simulator-optional). This should help with the failure rate with the current limitations.

But we should also work with ResInsight to see if we can lift these restrictions!

magnesj commented 3 weeks ago

You can also use 0 as port number, then GRPC will pick a port number to be used for the communication. The actual port number to use is communicated from ResInsight exe to a text file file read by Python. I am not sure how robust this method is.

https://github.com/OPM/ResInsight/blob/dev/GrpcInterface/Python/rips/instance.py#L97-L118 https://github.com/OPM/ResInsight/blob/dev/GrpcInterface/Python/rips/instance.py#L170

larsevj commented 1 week ago

Using port number 0 seems to have helped a bit, however starting many ResInsight instances at the same time still cause some issues, and it seems to be the communication of the port number via the text file that fails. Running the following slightly modified version of the script above:

import rips
import threading

Total_attempts=50
Max_sim_running=50

# Semaphore to limit to concurrent instances
instance_semaphore = threading.Semaphore(Max_sim_running)

def close_resinsight_instance(instance):
    """Function to close the ResInsight instance."""
    if instance:
        instance.exit()

def launch_resinsight_instance(executable_path: str, attempt: int, results: dict):
    print(f"Attempt {attempt + 1} to launch ResInsight...")
    with instance_semaphore:
        try:
            instance = rips.Instance.launch(executable_path, console=True, launch_port=0)
            if instance is None:
                print(f"Failed to launch ResInsight on attempt {attempt + 1}")
                results[attempt] = False
            else:
                results[attempt] = True
                # Schedule the instance to close after 5 seconds
                timer = threading.Timer(5, close_resinsight_instance, [instance])
                timer.start()
        except Exception as e:
            print(f"Error on attempt {attempt + 1}: {e}")
            results[attempt] = False

def test_resinsight_launches(executable_path: str, attempts: int = Total_attempts) -> int:
    failed_attempts = 0
    results = {}
    threads = []

    for i in range(attempts):
        thread = threading.Thread(target=launch_resinsight_instance, args=(executable_path, i, results))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    failed_attempts = sum(1 for success in results.values() if not success)
    return failed_attempts

if __name__ == "__main__":
    #executable_path = "/prog/ResInsight/current/ResInsight"
    failed_attempts = test_resinsight_launches(executable_path="")
    print(f"Total failed attempts: {failed_attempts}")

produces the following errors: output_rips_test.txt

kriben commented 1 week ago

@larsevj I have tried to reproduce this on my machine, and I have to set both Total_attempts and Max_sim_running to 500 before it starts failing. It will depend on the available hardware.

The log message Portnumber file retry count : 60 means the process timed out without producing a valid file.I will improve this error message. If I increase this time out from 60 to 120 I can run with Total_attempts=10000 and Max_sim_running=1000 without failures on modern hardware. I will increase the time out, and make it configurable for the next release of ResInsight.

There are also a small bug in you repro code: you should probably use time.sleep(5) instead of Threading.Timer() when simulating a five second ResInsight session. threading.Timer runs in a separate thread which does not block, and the semaphore is released before five seconds have passed and before the process is complete. Therefore it fails to limit the number of concurrent instances. You can see this with ps -ef | grep ResInsight: it should only list Max_sim_running number non-defunct ResInsight processes.

I have attached I tested with:

import threading
import rips
import time

Total_attempts=1000
Max_sim_running=200

# Semaphore to limit to concurrent instances
instance_semaphore = threading.Semaphore(Max_sim_running)

def close_resinsight_instance(instance):
    """Function to close the ResInsight instance."""
    if instance:
        instance.exit()

def launch_resinsight_instance(executable_path: str, attempt: int, results: dict):
    print(f"Attempt {attempt + 1} to launch ResInsight...")
    with instance_semaphore:
        try:
            instance = rips.Instance.launch(executable_path, console=True, launch_port=0)
            if instance is None:
                print(f"Failed to launch ResInsight on attempt {attempt + 1}")
                results[attempt] = False
            else:
                results[attempt] = True
                # close after 5 seconds
                time.sleep(5)
                close_resinsight_instance(instance)

        except Exception as e:
            print(f"Error on attempt {attempt + 1}: {e}")
            results[attempt] = False

def test_resinsight_launches(executable_path: str, attempts: int = Total_attempts) -> int:
    failed_attempts = 0
    results = {}
    threads = []

    for i in range(attempts):
        thread = threading.Thread(target=launch_resinsight_instance, args=(executable_path, i, results))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    failed_attempts = sum(1 for success in results.values() if not success)
    return failed_attempts

if __name__ == "__main__":
    #executable_path = "/prog/ResInsight/current/ResInsight"
    failed_attempts = test_resinsight_launches(executable_path="")
    print(f"Total failed attempts: {failed_attempts}")
kriben commented 1 week ago

By the way, a comment from the ResInsight developers. "ResInsight use GRPC as the communication protocol, and I found this link indicating a default of 100 concurrent sessions."

https://learn.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-8.0

This does not apply for this problem: the python script launches many ResInsight instances which have only on connection each.

kriben commented 1 week ago

Added increased timeout here: https://github.com/OPM/ResInsight/pull/11666

Will be part of ResInsight 2024.09 release.