Open ffel10 opened 1 month ago
As requested, the Python code to "reproduce" the ResInsight launching error. If Max_sim_running = 20, I obtain zero failures. On the other hand, for instance, for Max_sim_running i= 200, I obtained 178 launching errors.
import logging
from pathlib import Path
import rips
import threading
from threading import Semaphore, Timer
Total_attempts=200
Max_sim_running=20
# Semaphore to limit to concurrent instances
instance_semaphore = threading.Semaphore(Max_sim_running)
def close_resinsight_instance(instance):
"""Function to close the ResInsight instance."""
if instance:
instance.exit()
#print("ResInsight instance closed automatically after 5 seconds.")
def launch_resinsight_instance(executable_path: str, attempt: int, results: dict):
print(f"Attempt {attempt + 1} to launch ResInsight...")
with instance_semaphore:
try:
instance = rips.Instance.launch(executable_path, console=True)
if instance is None:
print(f"Failed to launch ResInsight on attempt {attempt + 1}")
results[attempt] = False
else:
results[attempt] = True
# Schedule the instance to close after 5 seconds
timer = threading.Timer(5, close_resinsight_instance, [instance])
timer.start()
except Exception as e:
print(f"Error on attempt {attempt + 1}: {e}")
results[attempt] = False
def test_resinsight_launches(executable_path: str, attempts: int = Total_attempts) -> int:
failed_attempts = 0
results = {}
threads = []
for i in range(attempts):
thread = threading.Thread(target=launch_resinsight_instance, args=(executable_path, i, results))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
failed_attempts = sum(1 for success in results.values() if not success)
return failed_attempts
if __name__ == "__main__":
executable_path = "/prog/ResInsight/current/ResInsight"
failed_attempts = test_resinsight_launches(executable_path)
print(f"Total failed attempts: {failed_attempts}")
By the way, a comment from the ResInsight developers. "ResInsight use GRPC as the communication protocol, and I found this link indicating a default of 100 concurrent sessions."
https://learn.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-8.0
Hi!
Limiting the number of simultaneously running instances by using a semaphore will unfortunately not work when the jobs are distributed onto a compute cluster (as they are not in the same process or even on the same machine). However, the number of concurrently running simulations can be limited by setting the cores
option under the simulator
section (See: https://fmu-docs.equinor.com/docs/everest/config_reference.html#simulator-optional). This should help with the failure rate with the current limitations.
But we should also work with ResInsight to see if we can lift these restrictions!
You can also use 0
as port number, then GRPC
will pick a port number to be used for the communication. The actual port number to use is communicated from ResInsight exe to a text file file read by Python.
I am not sure how robust this method is.
https://github.com/OPM/ResInsight/blob/dev/GrpcInterface/Python/rips/instance.py#L97-L118 https://github.com/OPM/ResInsight/blob/dev/GrpcInterface/Python/rips/instance.py#L170
Using port number 0
seems to have helped a bit, however starting many ResInsight instances at the same time still cause some issues, and it seems to be the communication of the port number via the text file that fails. Running the following slightly modified version of the script above:
import rips
import threading
Total_attempts=50
Max_sim_running=50
# Semaphore to limit to concurrent instances
instance_semaphore = threading.Semaphore(Max_sim_running)
def close_resinsight_instance(instance):
"""Function to close the ResInsight instance."""
if instance:
instance.exit()
def launch_resinsight_instance(executable_path: str, attempt: int, results: dict):
print(f"Attempt {attempt + 1} to launch ResInsight...")
with instance_semaphore:
try:
instance = rips.Instance.launch(executable_path, console=True, launch_port=0)
if instance is None:
print(f"Failed to launch ResInsight on attempt {attempt + 1}")
results[attempt] = False
else:
results[attempt] = True
# Schedule the instance to close after 5 seconds
timer = threading.Timer(5, close_resinsight_instance, [instance])
timer.start()
except Exception as e:
print(f"Error on attempt {attempt + 1}: {e}")
results[attempt] = False
def test_resinsight_launches(executable_path: str, attempts: int = Total_attempts) -> int:
failed_attempts = 0
results = {}
threads = []
for i in range(attempts):
thread = threading.Thread(target=launch_resinsight_instance, args=(executable_path, i, results))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
failed_attempts = sum(1 for success in results.values() if not success)
return failed_attempts
if __name__ == "__main__":
#executable_path = "/prog/ResInsight/current/ResInsight"
failed_attempts = test_resinsight_launches(executable_path="")
print(f"Total failed attempts: {failed_attempts}")
produces the following errors: output_rips_test.txt
@larsevj I have tried to reproduce this on my machine, and I have to set both Total_attempts
and Max_sim_running
to 500 before it starts failing. It will depend on the available hardware.
The log message Portnumber file retry count : 60
means the process timed out without producing a valid file.I will improve this error message. If I increase this time out from 60 to 120 I can run with Total_attempts=10000 and Max_sim_running=1000 without failures on modern hardware. I will increase the time out, and make it configurable for the next release of ResInsight.
There are also a small bug in you repro code: you should probably use time.sleep(5) instead of Threading.Timer() when simulating a five second ResInsight session. threading.Timer runs in a separate thread which does not block, and the semaphore is released before five seconds have passed and before the process is complete. Therefore it fails to limit the number of concurrent instances. You can see this with ps -ef | grep ResInsight
: it should only list Max_sim_running number non-defunct ResInsight processes.
I have attached I tested with:
import threading
import rips
import time
Total_attempts=1000
Max_sim_running=200
# Semaphore to limit to concurrent instances
instance_semaphore = threading.Semaphore(Max_sim_running)
def close_resinsight_instance(instance):
"""Function to close the ResInsight instance."""
if instance:
instance.exit()
def launch_resinsight_instance(executable_path: str, attempt: int, results: dict):
print(f"Attempt {attempt + 1} to launch ResInsight...")
with instance_semaphore:
try:
instance = rips.Instance.launch(executable_path, console=True, launch_port=0)
if instance is None:
print(f"Failed to launch ResInsight on attempt {attempt + 1}")
results[attempt] = False
else:
results[attempt] = True
# close after 5 seconds
time.sleep(5)
close_resinsight_instance(instance)
except Exception as e:
print(f"Error on attempt {attempt + 1}: {e}")
results[attempt] = False
def test_resinsight_launches(executable_path: str, attempts: int = Total_attempts) -> int:
failed_attempts = 0
results = {}
threads = []
for i in range(attempts):
thread = threading.Thread(target=launch_resinsight_instance, args=(executable_path, i, results))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
failed_attempts = sum(1 for success in results.values() if not success)
return failed_attempts
if __name__ == "__main__":
#executable_path = "/prog/ResInsight/current/ResInsight"
failed_attempts = test_resinsight_launches(executable_path="")
print(f"Total failed attempts: {failed_attempts}")
By the way, a comment from the ResInsight developers. "ResInsight use GRPC as the communication protocol, and I found this link indicating a default of 100 concurrent sessions."
https://learn.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-8.0
This does not apply for this problem: the python script launches many ResInsight instances which have only on connection each.
Added increased timeout here: https://github.com/OPM/ResInsight/pull/11666
Will be part of ResInsight 2024.09 release.
1) Problem statement : Currently, approximately 25% of my Everest well trajectory realizations fail due to the following error: 'ConnectionError: Failed to launch ResInsight.' The ResInsight error message states: 'Launching as console app. Port number file retry count: 60. Unable to read port number. Launch failed.'
2) I have reproduced the error using a separate script. If the script launches a large number of ResInsight instances simultaneously, ResInsight tends to fail launching (exceeds the Port number file retry count: 60). I can resolve this by limiting the number of active ResInsight instances, similar to the code below.
3) Possible solution: I suggest, for instance, a modification to _everest_models/jobs/fm_welltrajectory.py as suggested below.