apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

Add metrics to track time compaction jobs are queued #4945

Open keith-turner opened 1 week ago

keith-turner commented 1 week ago

Is your feature request related to a problem? Please describe.

Knowing how long compaction jobs are queued will be useful for managing and tuning compactions in accumulo and detecting problems. There are two metrics that would be useful, one is the time a job spend in the queue between addition and removal. Another is statistics about the age of jobs in the queue. One example where this would be useful is in the case where compactions are keeping up but for some reason jobs are spending on average 3 minutes in the queue, this would indicate some sort of problem that needs investigation. Ideally when compactions are keeping up, the average time int he queue should be low.

Describe the solution you'd like

To implement this compaction jobs themselves would not be the best place to attach a timer. This is because the system continually scans tablets looking for compaction jobs and when it finds jobs for a tablet it replaces anything queued for the tablet with the new jobs. Therefore it would probably be better to track age related statistics at the tablet level instead of the compaction job level.

A possible place to add this tracking would be CompactionJobPriorityQueue and the following is a possible beginning of an implementation.

class CompactionJobPriorityQueue {
  Map<KeyExtent, Timer> ages

  addJobToQueue(KeyExtent extent, CompactionJob job){
    ages.computeIfAbsent(extent, e->Timer.startNew())
  }

  void addTimeInQueueStat(Duration elapsed){
     // TODO update a Timer meter?  is that the best meter to use?
  }

  CompactionJob poll(){
   var timer = ages.get(extent);
    addTimeInQueueStat(timer.elapsed());
    timer.restart();
   // TODO remove from ages if the number of jobs for this tablet goes to zero
  }

  void clear(){
  // not sure what to do about this, seems like clearing ages map will throw off state
  }

  // This could be used for the oldest in queue stat
  Duration getOldestInQueue(){
  // TODO return max elapsed time in the ages.values()  to be used with a Guage meter
  // TODO is there a better meter to use for the ages of things sitting in the queue?
  }
}

Additional context

This could help asses the need for changes related to #4664 and the improvement those changes might make

cshannon commented 1 week ago

I can take a look at this