LLNL / ldms-plugins-llnl

Miscellaneous LDMS plugins from LLNL
Other
0 stars 1 forks source link

The meaning of read_bytes_sum and write_bytes_sum for job stats plugin #5

Closed wooloo1121 closed 3 years ago

wooloo1121 commented 3 years ago

Hi,

Thank you in advance for reading this.

We're trying to analyze the data collected by these job stats samplers, and are a little confused. As I can see, for each job and each ost, data are collected. I think the read/write_bytes_sum should be the sum of all bytes read/written so far, so the value should be always increasing. But this is not the case, the value can go from a bigger one to a smaller one or even zero. So what is the meaning of these values?

Thank you and look forward to your reply.

morrone commented 3 years ago

I would recommend checking the jobstats files on an ost and veryifying that the ldms plugin is reporting what it finds correctly. Note that in lustre stats files (including jobstats), some fields have and additional three numbers at the end of the line, and some do not. For the ones that do not, the value in the ldms metric will be the first number on the line (in column 2),. For lines that have the additional three number fields, the ldms metric value will come from very last number on the line (the third number out of the additional three fields).

That will help us determine if there is a bug in the plugin.

If the plugin is working, one thing to look into is how long the lustre server is configured to cache data for a particular job ID. This information is stored in memory, so the jobstats can't necessarily be stored forever. At some point the server will discard jobstats for jobs that haven't seen any activity in some configured period of time.

So lets consider a job only does I/O to a particular OST once every 2 hours. If the OST only caches jobstats data for 1 hour, then after an hour the stats will be thrown away, and the next time the job performs I/O the counters will have all started over at zero.

Let me know what you find!

wooloo1121 commented 3 years ago

@morrone Thank you so much for the reply. We checked and it should be the problem of the lustre server configuration.