BSC-ES / autosubmit-api

Autosubmit API is a package that consumes the information generated by Autosubmit and serves it as an API.
GNU General Public License v3.0
4 stars 0 forks source link

autosubmit gui bar graph not progressing #31

Open LuiggiTenorioK opened 1 year ago

LuiggiTenorioK commented 1 year ago

In GitLab by @ebergas on Sep 6, 2023, 09:50

Hi, I have several experiments I had to resubmit after this weekend. While the experiments themselves are running smoothly, the progress bar fails to reflect the true status. For instance, it currently displays "142/440 jobs," but more than 200 jobs have actually completed. However when I check the tree graph it shows an accurate representation of what is going on.

Screenshot_from_2023-09-06_09-39-08

What could be going wrong¿ I find it very useful to just check the progress with the preview and now I can not do that.

LuiggiTenorioK commented 1 year ago

In GitLab by @manuel-g-castro on Sep 6, 2023, 12:41

Before, and foremost! I am really sorry, but we don't have anyone in charge of both the API and GUI since Julián left. The new developer, Luiggi, should arrive in October 1st, according to @mcastril's latest news form HR.

BUT, that didn't stop me from trying to see what the issue was (even though, DISCLAIMER, I am an ignorant in web development).

This issue is indeed really, really, weird! I saw that there are more experiments that are incorrectly reporting in the progress bar: a6dj, a6e7, a6e8, a6dk, and a6di. But there is one that is not showing this behavior: a6e9, which is the only one that has been running since August 25th. All the other experiments seem to have failed somewhere around september 3rd, right? And you have rerun them in the afternoon of that same day.

@dbeltrankyl noticed that there are empty databases matching the expids of the problematic experiments. And, interestingly, the a6e9 is not among those troublesome databases. Maybe the issue might be solved by manually deleting this files, and rerunning the experiments. If you think this is worth it.

In the meantime, I am transfering this issue to the API since I believe that the GUI is reading the values properly.

mgimenez@bsces107930 ~ % cd /esarchive/autosubmit/as_metadata/data

mgimenez@bsces107930 ~/Documents/esarchive/autosubmit/as_metadata/data ls -l | grep -e a6dj -e  a6e7 -e a6e8 -e  a6dk -e a6di -e a6e9
-rwxrwxrw- 1 2401 565      7168 ago 14 19:31 job_data_a6di.db
-rw-rw-rw- 1 2401 565      1659 sep  6 10:46 job_data_a6di.sql
-rwxrwxrw- 1 2401 565      7168 ago 17 16:26 job_data_a6dj.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:15 job_data_a6dj.sql
-rwxrwxrw- 1 2401 565      7168 ago 17 16:39 job_data_a6dk.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:22 job_data_a6dk.sql
-rwxrwxrw- 1 2401 565      7168 ago 22 11:31 job_data_a6e7.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:26 job_data_a6e7.sql
-rwxrwxrw- 1 2401 565      7168 ago 22 11:49 job_data_a6e8.db
-rw-rw-rw- 1 2401 565      1659 sep  6 12:26 job_data_a6e8.sql
-rwxrwxrw- 1 2401 565    444416 sep  6 10:58 job_data_a6e9.db
-rw-rw-rw- 1 2401 565    360468 sep  6 10:58 job_data_a6e9.sql
LuiggiTenorioK commented 1 year ago

In GitLab by @manuel-g-castro on Sep 6, 2023, 12:41

moved from autosubmitreact#85

LuiggiTenorioK commented 1 year ago

Noticed that the completed jobs indicator from the progress bar are gathered from the job_data_{expid}.db, and the tree view gets it from the .pkl files. Might be an internal problem in the worker process that populates the .db files. We will need to reproduce the bug and debug that worker.

LuiggiTenorioK commented 1 year ago

In GitLab by @mcastril on Oct 9, 2023, 16:46

This is very interesting information Luiggi

LuiggiTenorioK commented 1 year ago

It seems that the number of completed jobs hasn't been updated by the populate_queue_run_times.py worker because it was trying to insert new data in a table that was constrained by its primary keys. So, it was needed to add an INSERT OR REPLACE statement to update it in case of stepping on a PRIMARY KEY constraint.

This was patched already on commit 4e41a40b which will be available on pre-release v4.0.0b2.

Even so, we will need to look close this in production to spot if this patch fixes the problem as there is no detail to reproduce the error.

LuiggiTenorioK commented 8 months ago

@mcastril @dbeltrankyl The issue we saw today with experiments a6zk and a70a is related to this. In this case, the experiment pkl file and the job_data_{expid}.db are not synchronized, and the DDBB file is also empty.

Initially, I thought that the experiment wasn't running but I saw that the experiment is active by looking if the AS_LOGS/20240320_160854_run.log was continuously updating.

LuiggiTenorioK commented 8 months ago

mentioned in issue autosubmit#1262

LuiggiTenorioK commented 8 months ago

Just tested another buggy experiment we saw yesterday (a6yi) where it doesn't show the total or completed jobs. The issue was related to the data types that weren't controlled in the removed as_times.db tables.

With version v4.0.0b5 it works as intended because it uses the distributed databases :tada: