Querying job_stats on large cluster(s) takes multiple seconds

mjurksa commented 1 year ago

Hello,

we have identified a performance issue with the lustre_exporter when querying metrics on large Lustre file systems with a significant number of jobstats. The problem seems to be related to the procfs.go script repeatedly accessing the same job_stats file in the procfs, resulting in a delay of 4-5 seconds per query. I think it should be possible to open the file once and scan each line for needed information aldough im not very well versed in GO and dont know if this would require significant refactor of the code. Is this issue known? Are there any workarounds around this?

gabrieleiannetti commented 1 year ago

Hi mjurksa,

how did you measure the delay of the metrics processing of the jobstats?

Which Lustre version and Lustre exporter version do you use?

Can you provide an overview of the jobstats count of your Lustre filesystem, so we can better do comparisons. Since querying Prometheus for jobstats might get pretty fast impossible because of that high data point count, we could compare some snapshot numbers executed on the server as following:

for OSS: lctl get_param obdfilter.*-OST*.job_stats
for MDS: lctl get_param mdt.*-MDT*/job_stats

Currently the max count for OSS is at 10k jobids with a scape time of 1-2s and for MDS it is at 8.5k jobids with a scrape time of 3-5s. Just to have some numbers, even that can surely change...

Checking the the promql query scrape_duration_seconds{job="lustre"} from the Prometheus exporter we can see an average scrape duration of 2 seconds for the Lustre exporter. The most slowest Lustre exporter is executed on one of our MDS with max peaks up to 14 seconds, where there can be seen hardly a correlation to the metadata count for that time interval. I have cross-checked another MDS with more metadata operations, which were at ~150k/s for over multiple hours constantly and there the Lustre exporter had just at the start the runtime peak of 8 seconds, that it was more or less then at 3 seconds runtime. The runtimes on the file server are slightly better than on the MDS and do not show such peaks.

Indeed, files are re-opened and -read again for each metric across multiple OSTs.
But that files are located under proc/fs which is kept in memory...

Yes, a workaround to just open and read the file once should not be a big deal.

Regardless of the implementation, I would consider to decrease the Lustre parameter job_cleanup_interval. We have set it to 300 for 5 minutes auto clean interval.

As scrape interval we have set for the exporter 30 seconds.

Best Gabriele

mjurksa commented 1 year ago

Hey,

We have around 15,000-20,000 jobs running daily, depending on the day of the week and holidays. We have compiled the latest release of the lustre_exporter and are using lustre-server version 2.12.8. The job_cleanup_interval is set to 300 in our case, and the scrape interval is set to 60 seconds. If the files are kept in memory on proc/fs, it would imply that interpreting the data is multiple times is the issue. Still, it would be a huge improvement for us if the files were processed once instead of multiple times.

We measured the performance like this:

# time curl localhost:9169/metrics >/dev/null
real    0m4.519s
user    0m0.024s
sys     0m0.005s

Regards, Mantautas

gabrieleiannetti commented 1 year ago

Hi Mantautas,

I do not see the problem, if the scrape interval is set to 60 seconds and the exporter has a runtime of few seconds, it is more than sufficient in your case.

Best Gabriele

mjurksa commented 1 year ago

Hello Gabriele,

well yes it works but thats waste of resources imho. We did some testing on a OSS where we have a lot of jobstats and the query took about 11seconds. If we would have a failure on the partner OSS and the resources would move over, that time would likely double or at least increase. The issue is also if that OSS has partner resources, that would mean this OSS is already consuming a lot of resources and it would help to save resources in this case. For small clusters this is a no issue, but in our case having >100 OSS with this big amount of jobs would mean that we potentially waste performance.

FYI: we also did some profiling and it looks like 50% of runtime is spend in the regex operations of jobstats.

Regards, Mantautas

gabrieleiannetti commented 1 year ago

Hi Mantautas,

what resources do you exactly mean e.g. CPU utilization? File server usually are not very CPU bound...

Can you give some numbers about the count of jobstats blocks to be processes? For instance we are processing currently at max peak 8-10k jobstat blocks on our file server...

You have an OSS failover pair, in which way the exporter impacts the move time of resources from one to another OSS?

Sure, with increasing amount of jobstats to be processed the exporter runtime will increase.
But I sill not see the critical aspect of waste of performance.

Do not get me wrong, I agree with you, wasting resources can hurt performance and also costs resources, but investigating, improving and testing the exporter for few seconds of performance gain against investing some time into it, is a tradeoff that has to be considered. And at the moment from our side there is not critical issue on this.

Interesting, how did you exactly profile the regex operations of the jobstats?
Indeed, 50% is huge. If that is the case, it could be considered to be improved.

Best Gabriele

adilger commented 10 months ago

You have an OSS failover pair, in which way the exporter impacts the move time of resources from one to another OSS?

I think the expectation is that if there are twice as many OSTs running on an OSS due to failover, then there are twice as many job_stats files, and that the parsing will twice as long.

adilger commented 10 months ago

Note that it is also possible to consume the job_stats file in a more efficient manner if there are lots of jobs accumulating in the cache. Note that jobs will live in the cache longer than job_cleanup_interval if no consumer is reading the job_stats file, so if there is only a single consumer of the job_stats cache, then it should be possible to shorten job_cleanup_interval to be equal (or slightly shorter) than the scrape interval. Then, the first time job_stats is read, any job that has been idle longer than job_cleanup_interval (snapshot_time < now - job_cleanup_interval) will be dropped from the cache. That should be OK if the consumer is processing the stats itself for each interval by taking the difference of the stats between scrapes, unless the stats are smaller (which indicates that the job was dropped from the cache).

Another alternative instead of waiting for job_cleanup_interval to automatically expire the jobs, is to clear out individual jobs from the cache explicitly by writing the job name into the respective job_stats file. This will drop the job from the cache immediately instead of having to re-process it on each interval for the next five or ten minutes. Note that reading the stats and then dropping them is a bit racy (there may have been some new RPCs arriving for that job after job_stats is read), so it would be prudent to only do this for jobs that have been idle for at least a short time (some seconds), and to drop each job as quickly as possible after it is read.

At worst, if a job became active immediately after the stats were read, the amount of stats lost would be ((drop_time - read_time) / scrape_interval). In the non-optimal case (i.e. processing all stats then clearing them), the drop_time - read_time = scrape_time instead of dropping the stats as soon as they are read. However, if there are fewer jobs to be processed each time, then the scrape time will be reduced significantly, presumably by a factor of (job_cleanup_interval / scrape_interval) (e.g. 600s / 30s = 20x) since there will be proportionately fewer unused jobs in the cache, so scrape times could be in the sub-second range and the maximum amount of stats that could be lost would be < 1%.

A more complicated mechanism would be something like "write job_name:snapshot_time to the job_stats file (probably with some better separator than :, maybe an unprintable control character that cannot appear in the job name but can still be generated by the shell), and only drop the job from the cache if snapshot_time has not changed. That would also need changes in the Lustre code (see lprocfs_jobstats_seq_write() but wouldn't be difficult to implement if that would improve the performance substantially.

GSI-HPC / lustre_exporter

Querying job_stats on large cluster(s) takes multiple seconds #30