BSC-ES / autosubmit-api

Autosubmit API is a package that consumes the information generated by Autosubmit and serves it as an API.
GNU General Public License v3.0
4 stars 0 forks source link

Too many open files #73

Open LuiggiTenorioK opened 7 months ago

LuiggiTenorioK commented 7 months ago

In the testing suite issue https://earth.bsc.es/gitlab/es/testing_suite/-/issues/50, they reported that the API is getting errors randomly related to the number of open files by the API.

I've reviewed the code at it seems that is possible this is happening because in many parts of the source code the database connection is opening but never closing, especially in really old modules.

Already checked how many open files are used by the admin user in production at ES and it is quite high (~1200 open files).

LuiggiTenorioK commented 7 months ago

I made a background process to track this amount of open files to see if they are accumulating. If this is the case, I will have to deeply review the code and fix everywhere connections are made without closing.

LuiggiTenorioK commented 7 months ago

CC @mcastril @kinow

LuiggiTenorioK commented 7 months ago

In GitLab by @kinow on Apr 3, 2024, 11:16

mentioned in commit 9a94496c19faf870e79cd5cd7cb06bbd74276a07

LuiggiTenorioK commented 7 months ago

Already plotted the evolution of the average number of open files per hour. In the first hour, it opened a lot of files, then it stayed the same during the day but now is increasing again at the end of the work hours.

It will be interesting to see if this leak is caused by some external process that is run once a day.

download

LuiggiTenorioK commented 7 months ago

mentioned in commit 9c242dc5146a5b31fe8d48fbc12e23654f150283

LuiggiTenorioK commented 7 months ago

In GitLab by @manuel-g-castro on Apr 4, 2024, 16:46

@dbeltrankyl suggested to add

#Log.debug(f"FD submit: {fd_show.fd_table_status_str()}")
#Log.debug(f"FD endsubmit: {fd_show.fd_table_status_str()}")
# Log.debug("FD recovery: {0}".format(log.fd_show.fd_table_status_str()))
LuiggiTenorioK commented 7 months ago

Updated plot (might open it in another tab to zoom it):

open_file_plot_20240408

Some conclusions:

LuiggiTenorioK commented 7 months ago

In GitLab by @kinow on Apr 8, 2024, 10:06

Leak increase seems not to be caused by a recurrent process The increase might be triggered by the usage of a tool (maybe the testing suite) that makes multiple requests in a short time (https://earth.bsc.es/gitlab/es/testing_suite/-/issues/50)

Ah, hadn't thought about the testing suite. Sounds plausible!

LuiggiTenorioK commented 6 months ago

After the deployment of the latest version of the API, this was the behavior on the open files:

track_of

After the patch, it now seems to grow faster and continuously. Even so, I found other opened DDBB files that are related to newer changes, meaning that this problem might be related to the migration to SQLAlchemy.

I made a test:

engine = common.create_autosubmit_db_engine() # Creates a SQLAlchemy Engine

with engine.connect() as conn:
    print_pid_lsof(current_pid) # Returns 1 open file as expected

print_pid_lsof(current_pid) # The file still opened after the connection is closed by the with __exit__

engine.dispose() # Explicit dispose must be used to release the file
print_pid_lsof(current_pid) # Returns 0 files

Reading the documentation, this is related to the Connection Pool created by the Engine.

In the case of SQLite, it keeps the file open even if there is no connection.

LuiggiTenorioK commented 5 months ago

Since the release of v4.0.0b4, this issue has been resolved for most of the DDBB files (autosubmit.db, as_times.db, etc).

Still, some distributed DDBBs stay open (job_data_xxxx.db and graph_data_xxxx.db), which is less frequent but existent. This is due to old modules that didn't correctly close the connection. As these DDBB managers are being refactored in the Postgres support branch, is expected that this issue will be closed once the Postgres support is done.