Closed LuiggiTenorioK closed 1 day ago
In GitLab by @mcastril on Jul 14, 2020, 19:01
Thanks @gmontanepinto . In fact also Auto-EC-Earth has the capability to run CMOR jobs in Nord3, although we usually run them in the same platform as the SIM, to avoid problems in PRACE/RES workflows.
I agree that it is difficult to do it without adding metadata. This should not affect the users because they usually copy experiments from the testing suite.
In GitLab by @pablohe on Jul 15, 2020, 07:56
Thanks @gmontanepinto . In fact also Auto-EC-Earth has the capability to run CMOR jobs in Nord3, although we usually run them in the same platform as the SIM, to avoid problems in PRACE/RES workflows.
CMORIZATION doesn't work in Nord3, @aamaral could you please confirm this?
The current implementation seems to rely in the epilog from the SIM job. This is written at the end of the job logs in Nord3 (I guess it uses MaxRSS), but not in MN4. I guess you only get this information in Nord3 @pablohe right?
It is not referenced in the code about MaxRSS
In GitLab by @aamaral on Jul 15, 2020, 11:18
Hi @pablohe
From what I remember it could work. I saw that I modified the configure.sh file on platforms/nord3, but to be honest I can't remember if it was tested. I usually ran the cmorisation itself on MN4.
In GitLab by @wuruchi on Jul 15, 2020, 13:10
closed via merge request autosubmit!183
In GitLab by @wuruchi on Jul 15, 2020, 13:10
mentioned in commit autosubmit@d40a0f0dc44702d2a9acd28376a04705a633cf65
In GitLab by @wuruchi on Jul 15, 2020, 13:19
reopened
In GitLab by @wuruchi on Jul 15, 2020, 13:29
Hello team,
The necessary changes have been applied to autosubmit!183 in order to store the information we will need to provide these performance metrics.
From the moment these changes are deployed:
/esarchive/autosubmit/as_metadata/data/
.extra_data
. This column stores the data we will gather from SLURM as a JSON object. This effectively allows us to add more values in the future without having to change the structure of the job. With this information, we can now focus on implementing the API calls to retrieve the performance metrics and the visualization in the GUI.
Also, once these changes are deployed and users start using it, we will slowly transition from the old retrieval tool (the workers that read from the file system) to consume the data stored in the experiment database.
You can query /esarchive/autosubmit/as_metadata/data/job_data_a2ze.db
for an example of how the database stores the changes and SLURM data.
Executing: select job_name, counter, last, submit, start, finish, status,platform,job_id, energy,extra_data from job_data;
This is one row: a2ze_20120101_000_2_REDUCE|2|1|1594810140|1594810202|1594810217|COMPLETED|marenostrum4|10904382|2230|{"10904382.extern": {"MaxRSS": "538K", "AveRSS": "538K", "energy": "2.23K"}, "10904382.batch": {"MaxRSS": "692K", "AveRSS": "692K", "energy": "1.99K"}, "10904382": {"MaxRSS": "NA", "AveRSS": "NA", "energy": "2.23K"}}
Finally, we also try to get the submit, start, and finish times directly from SLURM when possible.
In GitLab by @mcastril on Jul 23, 2020, 12:59
It is not referenced in the code about
MaxRSS
I think it is MaxRSS, what Nord3 reports as "Max Memory"
Thanks @aamaral , it was only an example that we cannot rely on a fixed rule and is better to establish some metadata to distinguish between SIM , POST (in a general sense) and TRANSFER jobs.
In GitLab by @mcastril on Jul 23, 2020, 13:03
Thanks @wuruchi . Could we have an idea of the overhead of extra_data? For the current volume of experiments as an example. Just to be sure about the tradeoff for storing this information.
Also, once these changes are deployed and users start using it, we will slowly transition from the old retrieval tool (the workers that read from the file system) to consume the data stored in the experiment database.
How we will deal with older experiments? Could the available information be added with an exporter or they will not be supported?
In GitLab by @wuruchi on Jul 23, 2020, 13:45
Hello @mcastril
Could we have an idea of the overhead of extra_data? For the current volume of experiments as an example. Just to be sure about the tradeoff for storing this information.
Saving the information into the database should not add noticeable overhead. However, we will see how it behaves once we start reading the data. The complexity is linear as a function of the input.
How we will deal with older experiments? Could the available information be added with an exporter or they will not be supported?
I have adapted the API to read information from the old data source (as_times.db
) when the historical database does not exist. This should be enough for the viewers (Tree and Graph). However, the performance metrics that involve energy consumption and other information that was not previously available, will be computed only for those jobs present in the historical database.
In GitLab by @mcastril on Jul 23, 2020, 16:51
Saving the information into the database should not add noticeable overhead. However, we will see how it behaves once we start reading the data. The complexity is linear as a function of the input.
At this point I was referring more to the storage (sorry for not being specific). The extra_data is a big field, and in experiments with thousands of jobs may make the database... three times bigger?. It seems sqlite3 writes binary data but does not support compression by default.
I understand this takes less space that the log files for example, but in any case we have to take into consideration that the /esarchive/autosubmit
directory is already too big, and we should be careful when adding new data.
I have adapted the API to read information from the old data source (
as_times.db
) when the historical database does not exist. This should be enough for the viewers (Tree and Graph). However, the performance metrics that involve energy consumption and other information that was not previously available, will be computed only for those jobs present in the historical database.
That's perfect, thanks.
In GitLab by @wuruchi on Jul 24, 2020, 12:25
Hello @mcastril
At this point I was referring more to the storage (sorry for not being specific). The extra_data is a big field, and in experiments with thousands of jobs may make the database... three times bigger?. It seems sqlite3 writes binary data but does not support compression by default.
The extra_data
field is only filled for SIM
jobs. For most experiments, SIM
jobs only represent a fraction of the total number of jobs. Considering that, we should not expect much space complexity in that regard.
I will take a look at the exact size of the different types of columns in sqlite3 to give a more accurate estimation.
In GitLab by @mcastril on Jul 28, 2020, 11:16
Thanks @wuruchi . It's true that usually each chunk has its post-processing and transfer jobs, so at the end SIMs may be 20% or 25% of the jobs. This raises the interesting question of including the energy consumption of the post-processing or not. This are usually serial jobs (using one full node) but usually they take some time, and in SR experiments where we run SIMs with 10-15 nodes maybe it is not so negligible.
In GitLab by @wuruchi on Jul 29, 2020, 13:21
Hello @mcastril @pablohe @macosta @gmontanepinto
I have added JPSY
to the current implementation of the Autosubmit API and the GUI. The Autosubmit API Performance Metrics API call has been updated to reflect these changes.
Also, the GUI has been updated to show this metric and the energy value from the job historical database: https://earth.bsc.es/autosubmitapp/experiment/t023
You might need to refresh the web page to see the changes.
In GitLab by @wuruchi on Aug 14, 2020, 18:00
Important: The feature is currently showing inaccurate data. There was an undetected issue with the data collection function that is now being addressed.
In GitLab by @wuruchi on Oct 1, 2020, 14:00
Hello @mcastril
JPSY
should be working for experiments using the latest version of Autosubmit, database version 12.
In GitLab by @mcastril on Jan 19, 2021, 11:48
@wuruchi you can review the conversation that we had about ASYPD / RSYPD:
ASYPD should count SIM + POSTPROCESSING running & queueing time for a given chunk. The metric is assigned to the SIM of the corresponding chunk.
RSYPD should count the wallclock time (including workflow interruptions) between the initialization of the SIM and the finalization of the chunk (so including CLEAN & TRANSFER jobs). The metric is also assigned to that SIM.
So we are aggregating running and queuing times in ASYPD, while in RSYPD we are just subtracting timestamps.
In order to distinguish SIM and POST jobs we need some meta-information that we can get from the TASK_TYPE parameter:
What do you think @gmontanepinto @ jberlin . Could we add this TASK_TYPE for the SIM, POST, CMOR, REDUCE, CALC_STATS, etc jobs of the different testing suites?
PS: Notice that in the current implementation of Auto-EC-Earth metrics there's no distinction between ASYPD and RSYPD, we postponed this development.
@macosta
In GitLab by @gmontanepinto on Jan 19, 2021, 17:46
Fine for me!
In GitLab by @wuruchi on Jan 19, 2021, 21:01
Hello @mcastril
From Balaji et al.
we have that:
The latest implementation of ASYPD
considers $t_0
$ as the submit time
of the first job in the experiment, $t_N
$ as the finish time
of the latest SIM
job in the experiment, and $N
$ is the number of simulated years in total until the latest SIM
job.
ASYPD should count SIM + POSTPROCESSING running & queueing time for a given chunk. The metric is assigned to the SIM of the corresponding chunk.
According to this description, I understand that we should not follow the t_0
definition for the metric in the paper and consider only the times for the SIM
and POSTPROCESSING
jobs.
RSYPD should count the wallclock time (including workflow interruptions) between the initialization of the SIM and the finalization of the chunk (so including CLEAN & TRANSFER jobs). The metric is also assigned to that SIM.
I understand that RSYPD
should start counting time from the start time
of the SIM
job until the finish time
of CLEAN
or TRANSFER
(whichever finishes latest) of the same chunk.
For SIMs I think we can account any job called SIM or with TASK_TYPE = SIM*
It seems that this TASK_TYPE
variable is not defined in the jobs_[expid].conf
. Where can we find it?
So far, we can only identify SIM
type jobs by testing for the inclusion of the string SIM
in the job's name.
In GitLab by @mcastril on Jan 20, 2021, 09:38
According to this description, I understand that we should not follow the
t_0
definition for the metric in the paper and consider only the times for theSIM
andPOSTPROCESSING
jobs.
Hi @wuruchi. What we did (correct me @macosta if I am wrong) is to create a new ASYPD metric, accounting only queuing + running times for SIM + POST jobs, and use RSYPD to nominate the paper's ASYPD.
So the tN-t0 can be valid for RSYPD.
I understand that
RSYPD
should start counting time from thestart time
of theSIM
job until thefinish time
ofCLEAN
orTRANSFER
(whichever finishes latest) of the same chunk.
That's what I had in mind. Instead of taking the total time and dividing by N chunks, we could calculate RSYPD per chunk and average the values.
However, this method for both ASYPD and RSYPD has an important inconvenient, that is not counting the parallelization. In most of our workflows, SIM only depends on the previous SIM (not counting DA and ENKF), so the experiment's total ASYPD should be equal to the sum of the queuing + running time for all the SIM's + the postprocessing of the last chunk (the critical path).
This is what @macosta was referring with:
We are using Autosubmit logs to take an accumulative total time of Run jobs and the last Post, assuming that the others are running in parallel.
And RSYPD is equal to the metric defined in the paper for ASYPD (wallclock time between first SIM and last transfer).
So, to define for once the metric, would it be ok to have:
ASYPD: Sum(Sum(Queuing(SIMi), Running(SIMi)) for i=1,N , Sum(Queuing(POSTi) + (Running(POSTi) i=N)
RSYPD: Substract(EndTime(TRANSFERi) i=N , StartTime(SIMi) i=1)
What do you think @macosta ? Do we need to discuss it?
It seems that this
TASK_TYPE
variable is not defined in thejobs_[expid].conf
. Where can we find it?
It's TASKTYPE actually: https://autosubmit.readthedocs.io/en/latest/variables.html
In GitLab by @macosta on Jan 20, 2021, 10:04
Hello everyone,
It is as @mcastril is commenting. Sorry for the confusion, for some metrics we decided to go beyond the paper. So the idea is to save (at least quantify) every SIMi, POSTi... With this approach we calculate ASYPD as Miguel was commenting but we can also print the average and accumulative time of SIM or POST jobs, or even create new metics in the future if it is needed.
In GitLab by @gmontanepinto on Jan 20, 2021, 10:09
It's TASKTYPE actually: https://autosubmit.readthedocs.io/en/latest/variables.html
This TASKTYPE variable just returns the name of the job (we use it in Auto-MONARCH). I understood that a new variable was going to be created to be able to specify the actual type of the job.
In GitLab by @mcastril on Jan 20, 2021, 10:27
Thanks @macosta . We save all the data and metrics are calculated on the fly, so this gives us the possibility to change things. By now I think @wuruchi can implement the calculus as in my last message and then we will have a review.
You are right @gmontanepinto . I saw Etienne was using it with the value SPINUP_FORCING
and I didn't realize that was the actual name of the job.
What do you think we should do? We have two options I think:
I think none of both approaches is ideal. But it maybe is less problematic to have them hardcoded as it is changing one line vs defining one parameter in lots of configurations.
In GitLab by @ jberlin on Jan 20, 2021, 11:39
Hi @mcastril , fine for me ! as mentioned by @gmontanepinto , the TASKTYPE is already there. I used it in the refactor of the CMOR templates, can´t we use the same one ?
In GitLab by @mcastril on Jan 20, 2021, 12:30
Yes, it is possible, but my intention was to use it to create meta-groups (SIM & POST only). Now we will have to work with the fine-grain names.
In GitLab by @wuruchi on Jan 20, 2021, 13:23
Hello @mcastril
So, to define for once the metric, would it be ok to have:
ASYPD: Sum(Sum(Queuing(SIMi), Running(SIMi)) for i=1,N , Sum(Queuing(POSTi) + (Running(POSTi) i=N) RSYPD: Substract(EndTime(TRANSFERi) i=N , StartTime(SIMi) i=1)
For the record, we define:
Status() =
$ Function that returns the status of a job as a string.Run() =
$ Function that returns the running time of a COMPLETED
job in seconds.Queue() =
$ Function that returns the queueing time of a COMPLETED
job in seconds.Start() =
$ Returns the start time of a job (not the submitted time) in seconds.Finish() =
$ Returns the finish time of a job in seconds.Years() =
$ Function that returns the number of simulated years of a SIM
(or identified as a such) job. Calculated from the experiment's chunk size
and chunk unit
.N =
$ Number of chunks $j
$ such that $Status(SIM_j) = Status(POST_j) = COMPLETED
$.M =
$ Number of chunks $k
$ such that $Status(SIM_k) = Status(TRANSFER_k) = COMPLETED
$.$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i) + Queue(POST_i) + Run(POST_i)}
$
$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{\sum_{i}^{M} Finish(TRANSFER_i) - Start(SIM_i)}
$
Where the units are $\frac{years*seconds/day}{seconds} = years/day
$
$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{(\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i))+ Queue(POST_n) + Run(POST_n)}
$
$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(TRANSFER_m) - Start(SIM_1)}
$
In GitLab by @mcastril on Jan 20, 2021, 15:27
$POST_i
$ should be $POST_n
$ in ASYPD. If you implement it on a loop, I will be n at the end, but it is more clear specifying it.
And for RSYPD, it should be $TRANSFER_n
$ - $SIM_1
$. Last TRANSFER's end minus first SIM's start.
In human code:
ASYPD is equal to the sum of all SIM's queuing and running times (because this is the only serialized part, as SIM depends on SIM-1, but POSTs can run in parallel to subsequent SIMs) added to last SIM's POST. Because after the last SIM you also have to postprocess it, and you cannot parallelize that part. This is assuming that all previous POST have run first, which I think it's fair to assume.
RSYPD accounts for the time interval between the start of the first simulation to the end of the last transfer. Basically, the wallclock time of the iterative part of the experiment. Normalized by YPD.
In GitLab by @wuruchi on Jan 20, 2021, 19:15
Hello @mcastril
Thank you for the corrections. Here are the formulas that will be implemented:
$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{(\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i))+ Queue(POST_n) + Run(POST_n)}
$
$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(TRANSFER_m) - Start(SIM_1)}
$
In GitLab by @mcastril on Jan 21, 2021, 10:12
Nice @wuruchi , here we go :ok_hand:
In GitLab by @wuruchi on Jan 21, 2021, 13:03
mentioned in commit autosubmit@9094128b7495df2889d6a39ed6bdc26b4b4bdeff
In GitLab by @wuruchi on Jan 21, 2021, 13:05
Hello @mcastril @macosta
The new definitions of ASYPD
and RSYPD
have been implemented in Autosubmit API and Autosubmit GUI.
Please check for inconsistencies in the results.
This is a complete example: https://earth.bsc.es/autosubmitapp/experiment/a33f
Some experiments do not have POST
or TRANSFER
jobs, this is shown as a warning.
In GitLab by @mcastril on Jan 22, 2021, 12:38
Thank you @wuruchi .
At first sight, ASYPD makes sense. In a33f
SIM jobs are waiting for ~1 day average, so that's why ASYPD decrease to 0.9. And in my a3e9
it is very close to the SYPD because there's only 1 SIM with a long wait, that is more or less long as the running time of a SIM, so it decreases 10% wrt to SYPD for 10 chunks.
https://earth.bsc.es/autosubmitapp/experiment/a3e9
However, there must be something with the RSYPD. It never can be higher than ASYPD. Is it because a33f
does not have a TRANSFER
? But I don't see the warning.
Nevertheless, I forgot to add that currently most of the EC-Earth workflows do not have a TRANSFER
job because we move the data directly on CLEAN
. Could you use CLEAN
instead of TRANSFER
for RSYPD
in case TRANSFER
is missing?
I have another question. In experiments having multiple start dates or members, which POST are you taking into account for ASYPD
? I think the safest approach would be the average of all last-chunk POSTs.
Anyway this is awesome, I think we will do an announcement through the list once we have these metrics ready because they can be quite useful for users to know their actual throughput including queues.
In GitLab by @wuruchi on Jan 22, 2021, 12:43
Hello @mcastril
However, there must be something with the RSYPD. It never can be higher than ASYPD. Is it because
a33f
does not have aTRANSFER
? But I don't see the warning.
Warnings are shown in the warning section:
Now I notice that the warnings says "ASYPD
cannot be computed." It should say RSYPD
instead. I'm going to fix it.
Nevertheless, I forgot to add that currently most of the EC-Earth workflows do not have a
TRANSFER
job because we move the data directly onCLEAN
. Could you useCLEAN
instead ofTRANSFER
forRSYPD
in caseTRANSFER
is missing?
Sure, I am going to add it.
I have another question. In experiments having multiple start dates or members, which POST are you taking into account for
ASYPD
? I think the safest approach would be the average of all last-chunk POSTs.
I am considering the latest POST, and then I only consider the SIM jobs with a finish time lesser than the finish time of that POST. An average would not be a problem.
In GitLab by @macosta on Jan 22, 2021, 12:45
I agree that an average of POST jobs make more sense
In GitLab by @mcastril on Jan 22, 2021, 12:51
Good, thanks!
Wrt to the Warning, I was talking about a33f
does do not have any TRANSFER
. It has TRANSFER_MEMBER
, on the other side, that is not the same, but it could pass the filter maybe as we were talking about wildcards above (to make it more "compatible").
So I think we should better rather looking for "CLEAN" and "TRANSFER" to avoid problems with CLEAN_MEMBER and TRANSFER_MEMBER, that do not move data.
In GitLab by @wuruchi on Jan 25, 2021, 10:23
mentioned in commit autosubmit@8fc799adebc38a3c571b13c15a67d2cb4907c7dc
In GitLab by @wuruchi on Jan 25, 2021, 10:44
Hello @mcastril @macosta
The ASYPD
and RSYPD
computation has been improved with your suggestions. Here are the updated definitions:
$ASYPD = \frac{(\sum_{i}^{N} Years(SIM_i))*86400}{(\sum_{i}^{N} Queue(SIM_i) + Run(SIM_i))+ \frac{1}{N}(\sum_{i}^{N} Queue(POST_i) + Run(POST_i))}
$
If TRANSFER
jobs exist:
$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(TRANSFER_m) - Start(SIM_1)}
$
If no TRANSFER
jobs but CLEAN
jobs exist:
$RSYPD = \frac{(\sum_{i}^{M} Years(SIM_i))*86400}{Finish(CLEAN_m) - Start(SIM_1)}
$
The calculation of ASYPD
and RSYPD
correctly ignores TRANSFER MEMBER
and CLEAN MEMBER
.
In GitLab by @mcastril on Jan 25, 2021, 14:52
Thanks for the correct specification @wuruchi.
I have one only amendment:
In ASYPD
, the N for SIM is different than the N for POST. In SIM we iterate through chunks, and in POST we iterate through startdates / members.
In fact, this is not anything particular of the ASYPD
and it should be common to all metrics (SYPD, RSYPD too). In any experiment having multiple members / startdates, all these metrics should be averages
over those levels (member / startdate). So you maybe can leave just N in POST (without the sum) and we will just specify the metric for a given member. Then we remind that all are averages at the experiment level. Sorry if it was not so clear in my previous message.
And just another comment about the dynamic update of the metrics:
If a user wants to query the ASYPD
or RSYPD
of a running experiment (same as they can do with SYPD), I think we should provide some data, even if the experiment is not finished. So I think that you could simply provide the metric for the last completed chunk, i.e. chunk run until POST for ASYPD, or chunk run until CLEAN/TRANS for RSYPD.
In GitLab by @wuruchi on Jan 25, 2021, 15:23
Hello @mcastril
In
ASYPD
, the N for SIM is different than the N for POST. In SIM we iterate through chunks, and in POST we iterate through startdates / members.In fact, this is not anything particular of the
ASYPD
and it should be common to all metrics (SYPD, RSYPD too). In any experiment having multiple members / startdates, all these metrics should beaverages
over those levels (member / startdate). So you maybe can leave just N in POST (without the sum) and we will just specify the metric for a given member. Then we remind that all are averages at the experiment level. Sorry if it was not so clear in my previous message.
Got it. I will update the definition in the API Wiki.
If a user wants to query the
ASYPD
orRSYPD
of a running experiment (same as they can do with SYPD), I think we should provide some data, even if the experiment is not finished. So I think that you could simply provide the metric for the last completed chunk, i.e. chunk run until POST for ASYPD, or chunk run until CLEAN/TRANS for RSYPD.
I think this should be working like that unless there are no completed SIM
jobs.
In GitLab by @mcastril on Feb 3, 2021, 16:10
mentioned in issue autosubmit#655
In GitLab by @mcastril on Mar 22, 2021, 15:53
mentioned in issue autosubmit#674
In GitLab by @wuruchi on May 31, 2021, 12:26
As I see it, the objectives of this issue have been reached. I will change it to documentation and close it. New developments on the topic of Performance Metrics would require a new issue that specifies those.
In GitLab by @mcastril on Apr 27, 2020, 12:47
moved from autosubmit#524
In GitLab by @mcastril on Apr 27, 2020, 12:47
Following the conversation started in https://earth.bsc.es/gitlab/es/autosubmit/issues/522#note_77443
We will include in this issue the metrics that we can implement in Autosubmit/GUI and dicuss about their development.