Performance metrics in Autosubmit/GUI #27

Closed LuiggiTenorioK closed 1 day ago

In GitLab by @mcastril on Apr 27, 2020, 12:47

Following the conversation started in https://earth.bsc.es/gitlab/es/autosubmit/issues/522#note_77443

Yes, sometimes I have been looking in the GUI and missed this information, have real timestamps. Another twist, but very reliable with the information that we have, would be to filter all the SIM jobs (if there's any), get the total simulated time (sum of SIM running times avoiding job repetitions) and divide by the simulated time chunkchunk_sizenchunks - startdate. That would give you an estimated number of SYPD. If you count queuing times you would have ASYPD. Well, we can define this a little bit more but this is something that definitely we will have one day.

We will include in this issue the metrics that we can implement in Autosubmit/GUI and dicuss about their development.

In GitLab by @mcastril on Jul 14, 2020, 19:01

Thanks @gmontanepinto . In fact also Auto-EC-Earth has the capability to run CMOR jobs in Nord3, although we usually run them in the same platform as the SIM, to avoid problems in PRACE/RES workflows.

I agree that it is difficult to do it without adding metadata. This should not affect the users because they usually copy experiments from the testing suite.

In GitLab by @pablohe on Jul 15, 2020, 07:56

Thanks @gmontanepinto . In fact also Auto-EC-Earth has the capability to run CMOR jobs in Nord3, although we usually run them in the same platform as the SIM, to avoid problems in PRACE/RES workflows.

CMORIZATION doesn't work in Nord3, @aamaral could you please confirm this?

The current implementation seems to rely in the epilog from the SIM job. This is written at the end of the job logs in Nord3 (I guess it uses MaxRSS), but not in MN4. I guess you only get this information in Nord3 @pablohe right?

It is not referenced in the code about MaxRSS

In GitLab by @aamaral on Jul 15, 2020, 11:18

Hi @pablohe

From what I remember it could work. I saw that I modified the configure.sh file on platforms/nord3, but to be honest I can't remember if it was tested. I usually ran the cmorisation itself on MN4.

In GitLab by @wuruchi on Jul 15, 2020, 13:10

closed via merge request autosubmit!183

In GitLab by @wuruchi on Jul 15, 2020, 13:10

mentioned in commit autosubmit@d40a0f0dc44702d2a9acd28376a04705a633cf65

In GitLab by @wuruchi on Jul 15, 2020, 13:19

reopened

In GitLab by @wuruchi on Jul 15, 2020, 13:29

Hello team,

The necessary changes have been applied to autosubmit!183 in order to store the information we will need to provide these performance metrics.

From the moment these changes are deployed:

A database for the experiment will be created in /esarchive/autosubmit/as_metadata/data/.
The database will store incremental historical job data information in a single table. Easy to query.
This table contains key job information as described in autosubmit#553.
There is one important column: extra_data. This column stores the data we will gather from SLURM as a JSON object. This effectively allows us to add more values in the future without having to change the structure of the job.
The database implements versioning. Thus, in case we needed to add columns, it will be possible.

With this information, we can now focus on implementing the API calls to retrieve the performance metrics and the visualization in the GUI.

Also, once these changes are deployed and users start using it, we will slowly transition from the old retrieval tool (the workers that read from the file system) to consume the data stored in the experiment database.

You can query /esarchive/autosubmit/as_metadata/data/job_data_a2ze.db for an example of how the database stores the changes and SLURM data.

Executing: select job_name, counter, last, submit, start, finish, status,platform,job_id, energy,extra_data from job_data;

This is one row: a2ze_20120101_000_2_REDUCE|2|1|1594810140|1594810202|1594810217|COMPLETED|marenostrum4|10904382|2230|{"10904382.extern": {"MaxRSS": "538K", "AveRSS": "538K", "energy": "2.23K"}, "10904382.batch": {"MaxRSS": "692K", "AveRSS": "692K", "energy": "1.99K"}, "10904382": {"MaxRSS": "NA", "AveRSS": "NA", "energy": "2.23K"}}

Finally, we also try to get the submit, start, and finish times directly from SLURM when possible.

In GitLab by @mcastril on Jul 23, 2020, 12:59

It is not referenced in the code about MaxRSS

I think it is MaxRSS, what Nord3 reports as "Max Memory"

Thanks @aamaral , it was only an example that we cannot rely on a fixed rule and is better to establish some metadata to distinguish between SIM , POST (in a general sense) and TRANSFER jobs.

In GitLab by @mcastril on Jul 23, 2020, 13:03

Thanks @wuruchi . Could we have an idea of the overhead of extra_data? For the current volume of experiments as an example. Just to be sure about the tradeoff for storing this information.

Also, once these changes are deployed and users start using it, we will slowly transition from the old retrieval tool (the workers that read from the file system) to consume the data stored in the experiment database.

How we will deal with older experiments? Could the available information be added with an exporter or they will not be supported?

In GitLab by @wuruchi on Jul 23, 2020, 13:45

Hello @mcastril

Could we have an idea of the overhead of extra_data? For the current volume of experiments as an example. Just to be sure about the tradeoff for storing this information.

Saving the information into the database should not add noticeable overhead. However, we will see how it behaves once we start reading the data. The complexity is linear as a function of the input.

How we will deal with older experiments? Could the available information be added with an exporter or they will not be supported?

I have adapted the API to read information from the old data source (as_times.db) when the historical database does not exist. This should be enough for the viewers (Tree and Graph). However, the performance metrics that involve energy consumption and other information that was not previously available, will be computed only for those jobs present in the historical database.

In GitLab by @mcastril on Jul 23, 2020, 16:51

Saving the information into the database should not add noticeable overhead. However, we will see how it behaves once we start reading the data. The complexity is linear as a function of the input.

At this point I was referring more to the storage (sorry for not being specific). The extra_data is a big field, and in experiments with thousands of jobs may make the database... three times bigger?. It seems sqlite3 writes binary data but does not support compression by default.

I understand this takes less space that the log files for example, but in any case we have to take into consideration that the /esarchive/autosubmit directory is already too big, and we should be careful when adding new data.

I have adapted the API to read information from the old data source (as_times.db) when the historical database does not exist. This should be enough for the viewers (Tree and Graph). However, the performance metrics that involve energy consumption and other information that was not previously available, will be computed only for those jobs present in the historical database.

That's perfect, thanks.

In GitLab by @wuruchi on Jul 24, 2020, 12:25

Hello @mcastril

At this point I was referring more to the storage (sorry for not being specific). The extra_data is a big field, and in experiments with thousands of jobs may make the database... three times bigger?. It seems sqlite3 writes binary data but does not support compression by default.

The extra_data field is only filled for SIM jobs. For most experiments, SIM jobs only represent a fraction of the total number of jobs. Considering that, we should not expect much space complexity in that regard.

I will take a look at the exact size of the different types of columns in sqlite3 to give a more accurate estimation.

In GitLab by @mcastril on Jul 28, 2020, 11:16

Thanks @wuruchi . It's true that usually each chunk has its post-processing and transfer jobs, so at the end SIMs may be 20% or 25% of the jobs. This raises the interesting question of including the energy consumption of the post-processing or not. This are usually serial jobs (using one full node) but usually they take some time, and in SR experiments where we run SIMs with 10-15 nodes maybe it is not so negligible.

In GitLab by @wuruchi on Jul 29, 2020, 13:21

Hello @mcastril @pablohe @macosta @gmontanepinto

I have added JPSY to the current implementation of the Autosubmit API and the GUI. The Autosubmit API Performance Metrics API call has been updated to reflect these changes.

Also, the GUI has been updated to show this metric and the energy value from the job historical database: https://earth.bsc.es/autosubmitapp/experiment/t023

You might need to refresh the web page to see the changes.

In GitLab by @wuruchi on Aug 14, 2020, 18:00

Important: The feature is currently showing inaccurate data. There was an undetected issue with the data collection function that is now being addressed.

In GitLab by @wuruchi on Oct 1, 2020, 14:00

Hello @mcastril

JPSY should be working for experiments using the latest version of Autosubmit, database version 12.

In GitLab by @mcastril on Jan 19, 2021, 11:48

@wuruchi you can review the conversation that we had about ASYPD / RSYPD:

ASYPD should count SIM + POSTPROCESSING running & queueing time for a given chunk. The metric is assigned to the SIM of the corresponding chunk.
RSYPD should count the wallclock time (including workflow interruptions) between the initialization of the SIM and the finalization of the chunk (so including CLEAN & TRANSFER jobs). The metric is also assigned to that SIM.

So we are aggregating running and queuing times in ASYPD, while in RSYPD we are just subtracting timestamps.

In order to distinguish SIM and POST jobs we need some meta-information that we can get from the TASK_TYPE parameter:

For SIMs I think we can account any job called SIM or with TASK_TYPE = SIM*
For POST we can add any job called POST or having TASK_TYPE = POST or CMOR* (it's possible that we also need to use the TASK_TYPE argument to run different CMOR jobs using the same template).

What do you think @gmontanepinto @ jberlin . Could we add this TASK_TYPE for the SIM, POST, CMOR, REDUCE, CALC_STATS, etc jobs of the different testing suites?

PS: Notice that in the current implementation of Auto-EC-Earth metrics there's no distinction between ASYPD and RSYPD, we postponed this development.

@macosta

In GitLab by @gmontanepinto on Jan 19, 2021, 17:46

Fine for me!

In GitLab by @wuruchi on Jan 19, 2021, 21:01

Hello @mcastril

From Balaji et al. we have that:

The latest implementation of ASYPD considers $t_0$ as the submit time of the first job in the experiment, $t_N$ as the finish time of the latest SIM job in the experiment, and $N$ is the number of simulated years in total until the latest SIM job.

ASYPD should count SIM + POSTPROCESSING running & queueing time for a given chunk. The metric is assigned to the SIM of the corresponding chunk.

According to this description, I understand that we should not follow the t_0 definition for the metric in the paper and consider only the times for the SIM and POSTPROCESSING jobs.

RSYPD should count the wallclock time (including workflow interruptions) between the initialization of the SIM and the finalization of the chunk (so including CLEAN & TRANSFER jobs). The metric is also assigned to that SIM.

I understand that RSYPD should start counting time from the start time of the SIM job until the finish time of CLEAN or TRANSFER (whichever finishes latest) of the same chunk.

For SIMs I think we can account any job called SIM or with TASK_TYPE = SIM*

It seems that this TASK_TYPE variable is not defined in the jobs_[expid].conf. Where can we find it?

So far, we can only identify SIM type jobs by testing for the inclusion of the string SIM in the job's name.

In GitLab by @mcastril on Jan 20, 2021, 09:38

According to this description, I understand that we should not follow the t_0 definition for the metric in the paper and consider only the times for the SIM and POSTPROCESSING jobs.

Hi @wuruchi. What we did (correct me @macosta if I am wrong) is to create a new ASYPD metric, accounting only queuing + running times for SIM + POST jobs, and use RSYPD to nominate the paper's ASYPD.

So the tN-t0 can be valid for RSYPD.

I understand that RSYPD should start counting time from the start time of the SIM job until the finish time of CLEAN or TRANSFER (whichever finishes latest) of the same chunk.

That's what I had in mind. Instead of taking the total time and dividing by N chunks, we could calculate RSYPD per chunk and average the values.

However, this method for both ASYPD and RSYPD has an important inconvenient, that is not counting the parallelization. In most of our workflows, SIM only depends on the previous SIM (not counting DA and ENKF), so the experiment's total ASYPD should be equal to the sum of the queuing + running time for all the SIM's + the postprocessing of the last chunk (the critical path).

This is what @macosta was referring with:

We are using Autosubmit logs to take an accumulative total time of Run jobs and the last Post, assuming that the others are running in parallel.

And RSYPD is equal to the metric defined in the paper for ASYPD (wallclock time between first SIM and last transfer).

So, to define for once the metric, would it be ok to have:

ASYPD: Sum(Sum(Queuing(SIMi), Running(SIMi)) for i=1,N , Sum(Queuing(POSTi) + (Running(POSTi) i=N)
RSYPD: Substract(EndTime(TRANSFERi) i=N , StartTime(SIMi) i=1)

What do you think @macosta ? Do we need to discuss it?

It seems that this TASK_TYPE variable is not defined in the jobs_[expid].conf. Where can we find it?

It's TASKTYPE actually: https://autosubmit.readthedocs.io/en/latest/variables.html

In GitLab by @macosta on Jan 20, 2021, 10:04

Hello everyone,

It is as @mcastril is commenting. Sorry for the confusion, for some metrics we decided to go beyond the paper. So the idea is to save (at least quantify) every SIMi, POSTi... With this approach we calculate ASYPD as Miguel was commenting but we can also print the average and accumulative time of SIM or POST jobs, or even create new metics in the future if it is needed.

In GitLab by @gmontanepinto on Jan 20, 2021, 10:09

It's TASKTYPE actually: https://autosubmit.readthedocs.io/en/latest/variables.html

This TASKTYPE variable just returns the name of the job (we use it in Auto-MONARCH). I understood that a new variable was going to be created to be able to specify the actual type of the job.

In GitLab by @mcastril on Jan 20, 2021, 10:27

Thanks @macosta . We save all the data and metrics are calculated on the fly, so this gives us the possibility to change things. By now I think @wuruchi can implement the calculus as in my last message and then we will have a review.

You are right @gmontanepinto . I saw Etienne was using it with the value SPINUP_FORCING and I didn't realize that was the actual name of the job.

What do you think we should do? We have two options I think:

Either creating a new variable, which would force to have the correct values filled for the metrics to work.
Or either hardcoding some usual names in the GUI TASKTYPE=(CMOR, POST, REDUCE*...)

I think none of both approaches is ideal. But it maybe is less problematic to have them hardcoded as it is changing one line vs defining one parameter in lots of configurations.

In GitLab by @ jberlin on Jan 20, 2021, 11:39

Hi @mcastril , fine for me ! as mentioned by @gmontanepinto , the TASKTYPE is already there. I used it in the refactor of the CMOR templates, can´t we use the same one ?

In GitLab by @mcastril on Jan 20, 2021, 12:30

Yes, it is possible, but my intention was to use it to create meta-groups (SIM & POST only). Now we will have to work with the fine-grain names.

In GitLab by @wuruchi on Jan 20, 2021, 13:23

Hello @mcastril

So, to define for once the metric, would it be ok to have:

ASYPD: Sum(Sum(Queuing(SIMi), Running(SIMi)) for i=1,N , Sum(Queuing(POSTi) + (Running(POSTi) i=N)
RSYPD: Substract(EndTime(TRANSFERi) i=N , StartTime(SIMi) i=1)

For the record, we define:

$Status() =$ Function that returns the status of a job as a string.
$Run() =$ Function that returns the running time of a COMPLETED job in seconds.
$Queue() =$ Function that returns the queueing time of a COMPLETED job in seconds.
$Start() =$ Returns the start time of a job (not the submitted time) in seconds.
$Finish() =$ Returns the finish time of a job in seconds.
$Years() =$ Function that returns the number of simulated years of a SIM (or identified as a such) job. Calculated from the experiment's chunk size and chunk unit.
$N =$ Number of chunks $j$ such that $Status(SIM_j) = Status(POST_j) = COMPLETED$.
$M =$ Number of chunks $k$ such that $Status(SIM_k) = Status(TRANSFER_k) = COMPLETED$.

$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i) + Queue(POST_i) + Run(POST_i)}$

$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{\sum_{i}^{M} Finish(TRANSFER_i) - Start(SIM_i)}$

Where the units are $\frac{years*seconds/day}{seconds} = years/day$

Corrected versions:

$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{(\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i))+ Queue(POST_n) + Run(POST_n)}$

$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(TRANSFER_m) - Start(SIM_1)}$

In GitLab by @mcastril on Jan 20, 2021, 15:27

$POST_i$ should be $POST_n$ in ASYPD. If you implement it on a loop, I will be n at the end, but it is more clear specifying it.
And for RSYPD, it should be $TRANSFER_n$ - $SIM_1$. Last TRANSFER's end minus first SIM's start.

In human code:

ASYPD is equal to the sum of all SIM's queuing and running times (because this is the only serialized part, as SIM depends on SIM-1, but POSTs can run in parallel to subsequent SIMs) added to last SIM's POST. Because after the last SIM you also have to postprocess it, and you cannot parallelize that part. This is assuming that all previous POST have run first, which I think it's fair to assume.
RSYPD accounts for the time interval between the start of the first simulation to the end of the last transfer. Basically, the wallclock time of the iterative part of the experiment. Normalized by YPD.

In GitLab by @wuruchi on Jan 20, 2021, 19:15

Hello @mcastril

Thank you for the corrections. Here are the formulas that will be implemented:

$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{(\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i))+ Queue(POST_n) + Run(POST_n)}$

$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(TRANSFER_m) - Start(SIM_1)}$

In GitLab by @mcastril on Jan 21, 2021, 10:12

Nice @wuruchi , here we go :ok_hand:

In GitLab by @wuruchi on Jan 21, 2021, 13:03

mentioned in commit autosubmit@9094128b7495df2889d6a39ed6bdc26b4b4bdeff

In GitLab by @wuruchi on Jan 21, 2021, 13:05

Hello @mcastril @macosta

The new definitions of ASYPD and RSYPD have been implemented in Autosubmit API and Autosubmit GUI.

Please check for inconsistencies in the results.

This is a complete example: https://earth.bsc.es/autosubmitapp/experiment/a33f

Some experiments do not have POST or TRANSFER jobs, this is shown as a warning.

In GitLab by @mcastril on Jan 22, 2021, 12:38

Thank you @wuruchi .

At first sight, ASYPD makes sense. In a33f SIM jobs are waiting for ~1 day average, so that's why ASYPD decrease to 0.9. And in my a3e9 it is very close to the SYPD because there's only 1 SIM with a long wait, that is more or less long as the running time of a SIM, so it decreases 10% wrt to SYPD for 10 chunks.

https://earth.bsc.es/autosubmitapp/experiment/a3e9

However, there must be something with the RSYPD. It never can be higher than ASYPD. Is it because a33f does not have a TRANSFER? But I don't see the warning.
Nevertheless, I forgot to add that currently most of the EC-Earth workflows do not have a TRANSFER job because we move the data directly on CLEAN. Could you use CLEAN instead of TRANSFER for RSYPD in case TRANSFER is missing?

I have another question. In experiments having multiple start dates or members, which POST are you taking into account for ASYPD? I think the safest approach would be the average of all last-chunk POSTs.

Anyway this is awesome, I think we will do an announcement through the list once we have these metrics ready because they can be quite useful for users to know their actual throughput including queues.

In GitLab by @wuruchi on Jan 22, 2021, 12:43

Hello @mcastril

However, there must be something with the RSYPD. It never can be higher than ASYPD. Is it because a33f does not have a TRANSFER? But I don't see the warning.

Warnings are shown in the warning section:

Now I notice that the warnings says "ASYPD cannot be computed." It should say RSYPD instead. I'm going to fix it.

Nevertheless, I forgot to add that currently most of the EC-Earth workflows do not have a TRANSFER job because we move the data directly on CLEAN. Could you use CLEAN instead of TRANSFER for RSYPD in case TRANSFER is missing?

Sure, I am going to add it.

I have another question. In experiments having multiple start dates or members, which POST are you taking into account for ASYPD? I think the safest approach would be the average of all last-chunk POSTs.

I am considering the latest POST, and then I only consider the SIM jobs with a finish time lesser than the finish time of that POST. An average would not be a problem.

In GitLab by @macosta on Jan 22, 2021, 12:45

I agree that an average of POST jobs make more sense

In GitLab by @mcastril on Jan 22, 2021, 12:51

Good, thanks!

Wrt to the Warning, I was talking about a33f does do not have any TRANSFER. It has TRANSFER_MEMBER, on the other side, that is not the same, but it could pass the filter maybe as we were talking about wildcards above (to make it more "compatible").

So I think we should better rather looking for "CLEAN" and "TRANSFER" to avoid problems with CLEAN_MEMBER and TRANSFER_MEMBER, that do not move data.

In GitLab by @wuruchi on Jan 25, 2021, 10:23

mentioned in commit autosubmit@8fc799adebc38a3c571b13c15a67d2cb4907c7dc

In GitLab by @wuruchi on Jan 25, 2021, 10:44

Hello @mcastril @macosta

The ASYPD and RSYPD computation has been improved with your suggestions. Here are the updated definitions:

$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{(\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i))+ \frac{1}{N}(\sum_{i}^{N} Queue(POST_i) + Run(POST_i))}$

If TRANSFER jobs exist:

$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(TRANSFER_m) - Start(SIM_1)}$

If no TRANSFER jobs but CLEAN jobs exist:

$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(CLEAN_m) - Start(SIM_1)}$

The calculation of ASYPD and RSYPD correctly ignores TRANSFER MEMBER and CLEAN MEMBER.

In GitLab by @mcastril on Jan 25, 2021, 14:52

Thanks for the correct specification @wuruchi.

I have one only amendment:

In ASYPD, the N for SIM is different than the N for POST. In SIM we iterate through chunks, and in POST we iterate through startdates / members.

In fact, this is not anything particular of the ASYPD and it should be common to all metrics (SYPD, RSYPD too). In any experiment having multiple members / startdates, all these metrics should be averages over those levels (member / startdate). So you maybe can leave just N in POST (without the sum) and we will just specify the metric for a given member. Then we remind that all are averages at the experiment level. Sorry if it was not so clear in my previous message.

And just another comment about the dynamic update of the metrics:

If a user wants to query the ASYPD or RSYPD of a running experiment (same as they can do with SYPD), I think we should provide some data, even if the experiment is not finished. So I think that you could simply provide the metric for the last completed chunk, i.e. chunk run until POST for ASYPD, or chunk run until CLEAN/TRANS for RSYPD.

In GitLab by @wuruchi on Jan 25, 2021, 15:23

Hello @mcastril

In ASYPD, the N for SIM is different than the N for POST. In SIM we iterate through chunks, and in POST we iterate through startdates / members.

In fact, this is not anything particular of the ASYPD and it should be common to all metrics (SYPD, RSYPD too). In any experiment having multiple members / startdates, all these metrics should be averages over those levels (member / startdate). So you maybe can leave just N in POST (without the sum) and we will just specify the metric for a given member. Then we remind that all are averages at the experiment level. Sorry if it was not so clear in my previous message.

Got it. I will update the definition in the API Wiki.

If a user wants to query the ASYPD or RSYPD of a running experiment (same as they can do with SYPD), I think we should provide some data, even if the experiment is not finished. So I think that you could simply provide the metric for the last completed chunk, i.e. chunk run until POST for ASYPD, or chunk run until CLEAN/TRANS for RSYPD.

I think this should be working like that unless there are no completed SIM jobs.

In GitLab by @mcastril on Feb 3, 2021, 16:10

mentioned in issue autosubmit#655

In GitLab by @mcastril on Mar 22, 2021, 15:53

mentioned in issue autosubmit#674

In GitLab by @wuruchi on May 31, 2021, 12:26

As I see it, the objectives of this issue have been reached. I will change it to documentation and close it. New developments on the topic of Performance Metrics would require a new issue that specifies those.

In GitLab by @mcastril on Apr 27, 2020, 12:47

moved from autosubmit#524

BSC-ES / autosubmit-gui

Performance metrics in Autosubmit/GUI #27

Corrected versions: