Open LuiggiTenorioK opened 7 months ago
I made a background process to track this amount of open files to see if they are accumulating. If this is the case, I will have to deeply review the code and fix everywhere connections are made without closing.
CC @mcastril @kinow
In GitLab by @kinow on Apr 3, 2024, 11:16
mentioned in commit 9a94496c19faf870e79cd5cd7cb06bbd74276a07
Already plotted the evolution of the average number of open files per hour. In the first hour, it opened a lot of files, then it stayed the same during the day but now is increasing again at the end of the work hours.
It will be interesting to see if this leak is caused by some external process that is run once a day.
mentioned in commit 9c242dc5146a5b31fe8d48fbc12e23654f150283
In GitLab by @manuel-g-castro on Apr 4, 2024, 16:46
@dbeltrankyl suggested to add
#Log.debug(f"FD submit: {fd_show.fd_table_status_str()}")
#Log.debug(f"FD endsubmit: {fd_show.fd_table_status_str()}")
# Log.debug("FD recovery: {0}".format(log.fd_show.fd_table_status_str()))
Updated plot (might open it in another tab to zoom it):
Some conclusions:
In GitLab by @kinow on Apr 8, 2024, 10:06
Leak increase seems not to be caused by a recurrent process The increase might be triggered by the usage of a tool (maybe the testing suite) that makes multiple requests in a short time (https://earth.bsc.es/gitlab/es/testing_suite/-/issues/50)
Ah, hadn't thought about the testing suite. Sounds plausible!
After the deployment of the latest version of the API, this was the behavior on the open files:
After the patch, it now seems to grow faster and continuously. Even so, I found other opened DDBB files that are related to newer changes, meaning that this problem might be related to the migration to SQLAlchemy.
I made a test:
engine = common.create_autosubmit_db_engine() # Creates a SQLAlchemy Engine
with engine.connect() as conn:
print_pid_lsof(current_pid) # Returns 1 open file as expected
print_pid_lsof(current_pid) # The file still opened after the connection is closed by the with __exit__
engine.dispose() # Explicit dispose must be used to release the file
print_pid_lsof(current_pid) # Returns 0 files
Reading the documentation, this is related to the Connection Pool created by the Engine.
In the case of SQLite, it keeps the file open even if there is no connection.
Since the release of v4.0.0b4, this issue has been resolved for most of the DDBB files (autosubmit.db
, as_times.db
, etc).
Still, some distributed DDBBs stay open (job_data_xxxx.db
and graph_data_xxxx.db
), which is less frequent but existent. This is due to old modules that didn't correctly close the connection. As these DDBB managers are being refactored in the Postgres support branch, is expected that this issue will be closed once the Postgres support is done.
In the testing suite issue https://earth.bsc.es/gitlab/es/testing_suite/-/issues/50, they reported that the API is getting errors randomly related to the number of open files by the API.
I've reviewed the code at it seems that is possible this is happening because in many parts of the source code the database connection is opening but never closing, especially in really old modules.
Already checked how many open files are used by the admin user in production at ES and it is quite high (~1200 open files).