Performance metrics in Autosubmit/GUI #27

Closed kinow closed 2 months ago

In GitLab by @mcastril on Apr 27, 2020, 12:47

Following the conversation started in https://earth.bsc.es/gitlab/es/autosubmit/issues/522#note_77443

Yes, sometimes I have been looking in the GUI and missed this information, have real timestamps. Another twist, but very reliable with the information that we have, would be to filter all the SIM jobs (if there's any), get the total simulated time (sum of SIM running times avoiding job repetitions) and divide by the simulated time chunkchunk_sizenchunks - startdate. That would give you an estimated number of SYPD. If you count queuing times you would have ASYPD. Well, we can define this a little bit more but this is something that definitely we will have one day.

We will include in this issue the metrics that we can implement in Autosubmit/GUI and dicuss about their development.

In GitLab by @mcastril on Apr 27, 2020, 12:58

@wuruchi , at this point @gmontane , @ojorba and the people from AC are interested in following changes in simulation times for the different testing suite test iterations.

In the GUI we currently have some tools to follow that information:

The tree view provides Queuing/Running time per job. It is very visual and easy to see big changes in the simulation time for a given chunk.

Captura_de_pantalla_de_2020-04-27_12-40-57

The Statistics section provides aggregated times per job/chunk, in a bar plot view with different series: Queued Run Failed Jobs_Failed Queued_Fail Run. And a summary report at the bottom.

Captura_de_pantalla_de_2020-04-27_12-40-03

Finally, there's another feature in the experiment's search view, allowing to see average running/queuing times for the jobs in an experiment (this is very useful to see failed jobs at a glance):

Captura_de_pantalla_de_2020-04-27_12-52-51

So I think we have enough tools to work, as soon as they give reliable information. But I think there are two things with a lot of potential:

@wuruchi do you think it would be easy to add another line in the Experiment summary (in the experiment's view) with the SIM average running time and SIM average queuing time, in case the experiment has SIM jobs? Both Auto-EC-Earth and Auto-MONARCH use the SIM nomenclature for their simulations, so having this we have much of the share.
@wuruchi as a second step, do you think that this new data could be available in the Autosubmit API? Then @gmontane and @pechevar could consume this info and serve it in the testing suite software.

With these you would have information of the current testing suite in order to handle it as you wish. Afterwards we could discuss about comparing with previous experiments. For this we would need to access the information from previous runs.

In GitLab by @mcastril on Apr 27, 2020, 12:59

I add @macosta to the loop.

In GitLab by @macosta on Apr 27, 2020, 13:05

I am happy that this is on going!

As you know CPMIP are standard performance metrics and our goal should be to include all of them through Autosubmit, we could have a telco for discussion

In GitLab by @wuruchi on Apr 27, 2020, 13:20

Hello @mcastril

Thanks for the detailed description of your requirements.

It should be possible to implement this in a reasonable time. I will update this issue as soon as I have something to show.

In GitLab by @wuruchi on Apr 27, 2020, 17:40

Hello @mcastril

I have implemented the first version of this feature in: https://earth.bsc.es/autosubmitapp/. Click on Show Detailed Data or Summary of any experiment. Then you will see in the summary the line:

SIM (X): avg. queue Y min. (Z) | run A min. (B)

where:

X: Number of SIM jobs in the experiment.
Y: Average Queuing time for a number Z of jobs with status FAILED, SUBMITTED, QUEUING, RUNNING, COMPLETED.
A: Average Running time for a number B of jobs with status FAILED, RUNNING, COMPLETED.

The API request has been implemented, more information in: https://earth.bsc.es/gitlab/wuruchi/autosubmitreact/-/wikis/Experiment-Summary

In GitLab by @mcastril on Apr 27, 2020, 18:10

Hi @wuruchi , thanks for your promptness. I think it is a very nice feature. @gmontane , @pechevar , you have it ready in the API, so you can improve the testing suite to give this number in the report.

One thing that we have to think how to handle is the queuing times in experiments with wrappers. Jobs waiting in a running wrapper count as queuing, even if they are still not ready to run. This inflates the total and average queuing times for experiments with wrappers. We will look for a solution, this is only for you to take into account this fact when reading the numbers.

In GitLab by @wuruchi on Apr 28, 2020, 10:16

The next step is to include these metrics somewhere inside the Experiment Page. That will require a little bit more work since the data should be retrieved from the Tree View or Graph View instead of directly from the API, to reuse the already retrieved data instead of generating unnecessary traffic for the API.

I will start that work, but on the front-end, I am open to suggestions on where to place these metrics.

In GitLab by @mcastril on Apr 28, 2020, 10:58

I think you can create an extra tab "Performance". The Tree View loads automatically when entering in the Experiment Page, so the information should be there I guess. What you could do is an automatic Refresh of the Tree View info when loading the Performance tab.

In GitLab by @gmontane on Apr 29, 2020, 12:01

Hi @wuruchi, thanks for implementing this. A couple of questions:

Is the summary based only on the last run of the experiment? If yes, would it be possible to get also the data from previous runs? Not by default, but maybe with an extra parameter in the API request.
Also I found that in the case of our lightweight experiments that take less than a minute to run, the returned SIM avg run time is 0:

Captura_de_Pantalla_2020-04-29_a_les_11.46.38

It is possible to show this time with more precision? At least for the case where it is less than 1 minute.

In GitLab by @wuruchi on Apr 29, 2020, 14:25

Hello @gmontane

It is possible to show this time with more precision? At least for the case where it is less than 1 minute.

Sure, this can be fixed. I will comment here as soon as it is implemented.

Is the summary based only on the last run of the experiment? If yes, would it be possible to get also the data from previous runs? Not by default, but maybe with an extra parameter in the API request.

I think this is possible since Autosubmit stores in the logs (_TOTAL_STATS files) the stats for all the runs of the corresponding jobs; however, I am not sure how reliable this could be. Perhaps @mcastril has something to say about this.

In case we try to implement this feature, it will take some time; there are some fundamental things that need to be changed in the way we are currently mapping information from Autosubmit logs to the central database that the API queries.

In GitLab by @mcastril on Apr 29, 2020, 16:42

Yes, I was thinking this is no trivial.

With these you would have information of the current testing suite in order to handle it as you wish. Afterwards we could discuss about comparing with previous experiments. For this we would need to access the information from previous runs.

All this information is from the Autosubmit new job database right? So looking into TOTAL_STATS will easily produce some inconsistencies. We have a pending task to review Autostubmit stats.

On the other hand I think @gmontane means previous runs done from now on (not historical). In that case, it would be possible to access the information in the database for previous runs? Is it being kept?

In GitLab by @wuruchi on Apr 29, 2020, 16:48

Hello @mcastril

On the other hand I think @gmontane means previous runs done from now on (not historical). In that case, it would be possible to access the information in the database for previous runs? Is it being kept?

No, the current implementation only keeps the latest job information.

It could be implemented, but I think that information management of that scope will require that each experiment has its own database, because it will make database locked errors more frequent as the size and frequency of information insertions and updates will increase.

On the other hand, we can make the switch to elasticsearch and keep data centralized.

In GitLab by @mcastril on Apr 29, 2020, 16:54

Thanks @wuruchi . This is something that we have to think and discuss carefully.

In the meantime, as this is a new feature, @gmontane I think you can simply write the weekly suite output in an issue as it is done for Auto-EC-Earth, and you can post the SYPD there too, so there's a way to look back.

In GitLab by @gmontane on Apr 29, 2020, 17:04

OK, for now we can just make the testing suite give the numbers of the current run and do a manual comparison. Thanks!

In GitLab by @macosta on Apr 29, 2020, 17:11

Taking into account that some of the metrics that you are talking are closed to the CPMIP metrics and the performance job using TOTAL-STATS to calculate them. Does it make sense to think about the partial CPMIP implementation now? Could we do a telco next week maybe?

In GitLab by @mcastril on Apr 29, 2020, 17:14

Yes, we can start talking about it. This is something that we had in mind so it fits perfectly with our plans.

In GitLab by @mcastril on Apr 29, 2020, 17:18

I chose a date/time in which all of you are available in the calendar. Please tell me if there's any problem.

@wuruchi @dbeltran , this is the paper presenting the CPMIP metrics.

https://www.researchgate.net/publication/312034731_CPMIP_Measurements_of_real_computational_performance_of_Earth_system_models_in_CMIP6

I think @macosta has a work document with some more information (I don't remember if I have access to).

In GitLab by @wuruchi on Apr 29, 2020, 17:20

Scheduled time is Ok with me.

I am going to take a look at the paper.

In GitLab by @gmontane on Apr 29, 2020, 18:13

mentioned in issue testing_suite#13

In GitLab by @wuruchi on Apr 29, 2020, 18:49

Hello @gmontane @mcastril

I have updated the GUI with the requested changes.

Times are now shown in datetime format.
Times are shown in bold for better visualization.
API documentation has been updated: https://earth.bsc.es/gitlab/wuruchi/autosubmitreact/-/wikis/Experiment-Summary

I wonder if I should also apply this format to the times shown in the Tree View.

In GitLab by @dbeltran on Apr 30, 2020, 09:26

Okay, I'll take a look at the paper next week

In GitLab by @macosta on Apr 30, 2020, 09:37

As Miguel was commenting, here you can find a summary only with the metrics:https://docs.google.com/document/d/12yWDwXsohf4G4MPeP6e3Eil4ZL-YeIN71dBcoWRliEg/edit

Probably take a look to the performance job of auto-ecearth could be also interesting because at the end we want to port those metrics already coded from this shell script to your autosubmit solution. You do not need to start from scratch ;)

In GitLab by @mcastril on May 8, 2020, 13:44

I wonder if I should also apply this format to the times shown in the Tree View.

Maybe yes. For very long jobs the number of minutes is not so direct.

In GitLab by @mcastril on May 8, 2020, 19:03

Hi all,

As we decided in the meeting, Autosubmit should directly compute the following metrics:

Model
Platform
SYPD
ASYPD
RSYMD
CHSPY

As @macosta said, the calculations can be looked in the performance metrics template in Auto-EC-Earth (to understand how they work).

In the Performance Metrics view in Autosubmit and Autosubmit GUI, another column for the ?YPD metrics translation to ?DPD (days per day) should be provided.

For the rest of the metrics, we have to figure out a mechanism to communicate their values from the Auto-Models to Autosubmit.

Finally, we also have to figure out how to store different runs of the same experiment and distinguish between different instances (create an internal ID after a new create?)

In GitLab by @pechevar on May 13, 2020, 09:52

mentioned in issue auto-ecearth3#990

In GitLab by @mcastril on May 29, 2020, 15:16

@wuruchi has announced the availability of some metrics trough an API:

https://earth.bsc.es/gitlab/es/testing_suite/issues/13#note_83930

I checked one experiment that I launched recently:

http://192.168.11.91:8081/performance/a2vp

The SYPD is very similar to the one provided by the EC-Earth runscript: 4.212 (API) vs 4.24 (script). This may be because Autosubmit counts some more seconds for the job to be terminated. This was a very short job, so the difference should be higher that in real simulations, and in any case it was only 1%.
Same for CHSY: 7111.17 against 7303. In this case it's a 2% difference, but we should compare using a longer experiment.
No comparison for the ASYPD, but it is reasonable taking into account that it was 3 hour queuing to only run for 10 minutes.

In GitLab by @mcastril on May 29, 2020, 15:19

@wuruchi , I am looking at t0e0 but I can only see up to chunk 89 in the API, while we have logs until chunk 90.

In GitLab by @wuruchi on May 29, 2020, 15:25

For some reason, chunk 90 appears near the end of the answer, but it is there.

http://192.168.11.91:8081/performance/t0e0

I will try to improve the order.

In GitLab by @mcastril on Jun 4, 2020, 11:13

As @wuruchi announced in https://earth.bsc.es/gitlab/es/testing_suite/issues/13#note_84522

There is a first implementation of the Performance Metrics in the GUI:

https://earth.bsc.es/autosubmitapp/experiment/a2s5

It will be showed tomorrow in the CES meeting.

In GitLab by @wuruchi on Jun 4, 2020, 11:57

It definitely requires much more work. Suggestions are welcome.

In GitLab by @mcastril on Jun 4, 2020, 12:08

@wuruchi , regarding your message about the rest of the metrics:

The energy consumption is available through Slurm. This is related with autosubmit#484 and it could be done at the same time, providing also that info in the job logs.
For the rest of the metrics, I think you can have a look at the Auto-EC-Earth Perf. Metrics template:

https://earth.bsc.es/gitlab/es/auto-ecearth3/-/blob/trunk/templates/common/performance_metrics.tmpl.sh

Some metrics as the RSYPD can be ported to AS.

In GitLab by @wuruchi on Jun 4, 2020, 12:11

Taking a look.

In GitLab by @wuruchi on Jun 4, 2020, 13:13

@mcastril @pechevar

From what I am seeing, I can infer that not all %SIM%.out files will include the performance metrics (for example energy consumption, ASYPD, SYPD, etc.) in the same format or might include only some of them. Is that correct? I guess it depends on the model the user is running his experiment on.

In GitLab by @mcastril on Jun 4, 2020, 13:39

Regarding energy consumption, ASYPD, SYPD... we can include them in all SIM jobs because this is kind of an standard. Maybe for MONARCH we can change to a SDPD, this can be triggered when the CHUNKSIZE is day instead of month or year.

There are other metrics that are more model specific and that we will need to create an interface between the auto-model and autosubmit.

In GitLab by @wuruchi on Jun 4, 2020, 13:41

Yep. I was looking at the auto-ecearth code and comparing it to experiments results and it seems pretty hard to get the values from the .out files in a reliable manner.

I guess we can start by adding a slurm request in Autosubmit that gets the energy consumption of the job independent of the model used.

In GitLab by @macosta on Jun 4, 2020, 13:52

@wuruchi for the other metrics, please take a look but actually we should discuss about the implementation. I think we can talk any moment next week once you have clear these metrics, then we can discuss about the implementation.

In GitLab by @wuruchi on Jun 4, 2020, 13:56

Agree, @macosta. I think the focus is gravitating more towards getting more information for Autosubmit. The actual implementation will be hold until the previous point is defined.

In GitLab by @pechevar on Jun 4, 2020, 17:10

Hi, in this https://earth.bsc.es/autosubmitapp/experiment/t01q experiment that I run every week more or less, I see

Parallelization: 768
SYPD: 14400
ASYPD: 14400
CHSY: 0

It is ok that the GUI show those numbers: 14400

and

Parallelization is SIM number of cores

Is it any reason to change the named ? It is only a question

In GitLab by @wuruchi on Jun 5, 2020, 08:22

Hello @pechevar

The 14400 bug happens when there are no SIM jobs. I will fix it in the next iteration.

Parallelization is SIM number of cores

Yes, it is. We adopted that name because that is how it is named in the provided documentation.

In GitLab by @wuruchi on Jul 13, 2020, 13:16

mentioned in commit autosubmit@c138dbdf0368504184c34b3d4c9cd7f9adf76809

In GitLab by @wuruchi on Jul 13, 2020, 13:20

mentioned in merge request autosubmit!183

In GitLab by @mcastril on Jul 14, 2020, 15:44

We have been discussing briefly about the next steps. Summarizing:

Most of the metrics in https://earth.bsc.es/gitlab/es/autosubmit/issues/524#note_80848 and the Memory bloat can be directly calculated in Autosubmit.
The remaining ones, which are Coupling Cost, Data output cost, Data intensity, that are more model dependent and more difficult to measure, we can rely in the current template and provide a way to interface this with Autosubmit, as we discussed previously.

For the metrics in 1) we have 2 main doubts:

Which parameter we should look for in sacct to get the memory bloat?
- Looking an old conversation between @macosta and David Vicente, he recommend to use AveRSS or MaxRSS.
- An alternative approach, not using Slurm, is to use the time command, but again this should be on the Auto-Model side. I think it is better to use Autosubmit if we can.
- The current implementation seems to rely in the epilog from the SIM job. This is written at the end of the job logs in Nord3 (I guess it uses MaxRSS), but not in MN4. I guess you only get this information in Nord3 @pechevar right?
The second doubt is about ASYPD and RSYPD. I always (or lately only because I just forgot) thought that ASYPD was RUN+QUEUE, and RSYPD was counting the full critical path. However, looking at the doc, ASYPD should also count stops and interruptions. RSYPD is not detailed in the documentation https://docs.google.com/document/d/12yWDwXsohf4G4MPeP6e3Eil4ZL-YeIN71dBcoWRliEg/edit and the implementation in the template doesn't help because it is just the same: https://earth.bsc.es/gitlab/es/auto-ecearth3/-/blob/trunk/templates/common/performance_metrics.tmpl.sh#L158 (I think we did this manually for CMIP6 @macosta ?).

In GitLab by @mcastril on Jul 14, 2020, 15:45

In the EC-Earth forum I found a deeper explanation of ASYPD/RSYPD by @macosta :

https://dev.ec-earth.org/issues/532

We are dividing this in two metrics:

ASYPD is the same that Philippe is suggesting, including queue and run time for the EC-Earth run script, but we are also including queue and run time for post processing, so we understand that it is a value indicating the complete process to generate outputs of 1 year. We are using Autosubmit logs to take an accumulative total time of Run jobs and the last Post, assuming that the others are running in parallel. Philippe approach should be feasible for different platforms, but some specific development would needed for each one. Another question appears here for our BSC approach, we are not including the resources of postprocessing for metrics as CHPSY or parallelization...

RSYPD is measuring the start date of the submission of the first job and the final date of the last job. This means that we are including the time needed when some job fails and the experiment stops, waiting for someone to rerun a particular chunk for example (quota problems, nodes failing...) you can surprise about how different are ASYPD and RSYPD. We are using Autosubmit logs.

In GitLab by @mcastril on Jul 14, 2020, 15:51

So some ideas here:

For ASYPD, if we counted all the CHUNK jobs depending on SIM, it should be very easy to do and generalizable. Of course we can count only the POST, but then it will not be usable by other models. Given that EC-Earth3 cmorizes the data in two independent jobs (CMORATM and CMOROCE) before the POST, I think that both CMOR jobs should be count as POST. In that case we could count all the remote jobs after SIM (CLEAN and TRANSFER run in the transfer nodes).
The RSYPD is not much complex after we define the ASYPD. What do we understand as "final date"? If it is the date of the POST, we should do the same as what we decide to do in the ASYPD.

In GitLab by @macosta on Jul 14, 2020, 15:59

Thank you Miguel for the research work! I will start for the ASYPD/RSYPD discussion.

Here we are improving the community definition so we have not relay on other papers. Our agreement should to include in ASYPD those jobs needed only to run and postprocess (prepare) the outputs. It could be CMOR jobs too but I would not introduce CLEAN and transfer since it is more machine dependent.

RSYPD as you say it is easier once we define ASYPD, which should include the complete workflow and the interruptions/variability of a machine.

Other thing that we could do it is to include an additional one. I mean, ASYPD could be CMOR+RUN+POST (only jobs to produce outputs). RSYPD the complete workflow including clean, transfers and other stuff. Finally, we could include something like FSYPD (Final SYPD), which should measure the time from the beginning to the end to include interruptions and variability.

What do you think?

In GitLab by @macosta on Jul 14, 2020, 16:03

COupling cost should be included by Sergi implementation (LUCIA) and probably provided by the automodel output as Data Ouput cost and Data intensity as you say.

For memory bloat, I agree that we could use MaxRSS if it is possible.

Finally I have an additional question, are you taking into account those metrics that we included in the STATS job and where they will be calculated and saved? For example, the execution time for each job separately, Max/MIN/AVG time for jobs as SIM or POST...

In GitLab by @mcastril on Jul 14, 2020, 16:17

Here we are improving the community definition so we have not relay on other papers. Our agreement should to include in ASYPD those jobs needed only to run and postprocess (prepare) the outputs. It could be CMOR jobs too but I would not introduce CLEAN and transfer since it is more machine dependent.

Perfect. As we run CLEAN and TRANSFER in local (or data transfer) nodes, they should not be counted.

Other thing that we could do it is to include an additional one. I mean, ASYPD could be CMOR+RUN+POST (only jobs to produce outputs). RSYPD the complete workflow including clean, transfers and other stuff. Finally, we could include something like FSYPD (Final SYPD), which should measure the time from the beginning to the end to include interruptions and variability.

I think it is important to count the transfers in some way, because in the end is part of the workflow and users normally work with the data locally. I don't know if this is for all the institutions (maybe some use the data remotely?).

What I find more difficult is to automatically identify what are post-processing and data-transfer jobs. We can consider that anything running in the cluster after the SIM is POST, as we said previously. But the rest of the jobs can combine clean, transfer, archive, diagnostics... We have to give it a think.

For Auto-MONARCH, in the DA workflow they are now running REDUCE + CALC_STATS remotely (that would be the POST) and then they run ARCHIVE (that would be the TRANSFER). But I think that sometimes REDUCE or CALC_STATS are run offline.

In GitLab by @mcastril on Jul 14, 2020, 16:19

@gmontane is the one that knows better about Auto-MONARCH and how we could converge.

If we don't find a way to do it automatically, we could maybe create an optional parameter in the jobs.conf , to tag any section (job) as SIM, POST or TRANSFER. In that case they would not be forced to call SIM to the simulation either.

In GitLab by @mcastril on Jul 14, 2020, 16:21

Finally I have an additional question, are you taking into account those metrics that we included in the STATS job and where they will be calculated and saved? For example, the execution time for each job separately, Max/MIN/AVG time for jobs as SIM or POST...

@wuruchi created a distributed database that will start to record that values for every job in the workflow. So yes.

https://earth.bsc.es/gitlab/es/autosubmit/issues/553

In GitLab by @gmontane on Jul 14, 2020, 18:53

In auto-monarch the REDUCE job is always run, and the CALC_STATS is only run in the DA case.

We can consider that anything running in the cluster after the SIM is POST

Both jobs can be run either in MN4 or Nord3 (as well as the POST job in auto-ecearth I guess), so I think that looking at the platform is not enough as it can be different than the one where the SIM runs. Right now I can't think of an easy common solution to know which are the POST jobs without adding more info in the jobs definition.

BSC-ES / autosubmit-gui

Performance metrics in Autosubmit/GUI #27