apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.06k stars 445 forks source link

Improve/standardize rate limiting logic in Monitor #4894

Open DomGarguilo opened 2 weeks ago

DomGarguilo commented 2 weeks ago

This PR uses Guavas Suppliers.memoizeWithExpiration() for caching, replacing the manual caching logic. This allows us to:

DomGarguilo commented 2 weeks ago

Remove the age-off logic that was being done in the fetch() methods. It should be noted that the functionality has slightly changed. Since we just re-create a new map whenever the data is refreshed in the fetch() methods, there are no old values that need to be aged off.

That is a nice simplification.

Yea definitely nice to be able to remove some code. Wasn't sure if this would be an issue though since before, the maps would contain values that would only be aged off after they were 15 minutes old. And now, since the data is recreated once every minute, all of the returned data will only ever be that old. Not too sure what the implications of this are or what the old values were being used for.

keith-turner commented 2 weeks ago

Wasn't sure if this would be an issue though since before, the maps would contain values that would only be aged off after they were 15 minutes old.

I misread that code when I was reviewing. I thought they were aging off after 15 milliseconds, so did not think it was an issue. But its actually 15 minutes. It would be good to understand that change in behavior a bit better. So it seems like some things used to hang around for 15 minutes in the monitor display even after they were gone?

DomGarguilo commented 1 week ago

So it seems like some things used to hang around for 15 minutes in the monitor display even after they were gone?

Yea, so after looking at the code a bit more I think that the purpose of the age-off logic was to remove entries whose server was no longer present. The code that collects this data iterates through each server returned from context.instanceOperations().getTabletServers() and uses the HostAndPort of that server as the key in the map. So if there is new data for that server, the old data will just be replaced, but in the case where there is some data for a server and then that server dies, then it will not be replaced by anything. So after 15 mins, if the server did not come back up, those entries will be aged off. It seems like we want to keep this functionality so I'm going to work on adding this age-off functionality back.

keith-turner commented 1 week ago

@DomGarguilo for these these dead servers that hang around in the monitor for 15mins, do you know if their last contact time continues to increase in monitor display?

DomGarguilo commented 1 week ago

@DomGarguilo for these these dead servers that hang around in the monitor for 15mins, do you know if their last contact time continues to increase in monitor display?

Okay so i was a bit mistaken. The info for servers that have dies did stick around in the map and then it depends on what the calling code does with it. For example, ScansResource sets up the rest endpoint for scans and it filters the data to only return scans from tservers that are alive, where as for scan server scans, it will return all of the data in the map. This means that if a scan server dies, as long as the 15 min age-off window has not passed, scans that were running on that scan server will still be displayed. This is just one example of how this potentially stale data is used in the monitor. I am taking a look at the other usages of this data that is kept around. I am not sure why we want to keep this data around on the monitor for scan servers but not for tservers.

Edit:

I am also seeing this scenario where I start a script that scans a table on a running scan server, then once that scan appears on the active scans page in the monitor, I kill the scan server and eventually it looks like that scan from the script starts running on the tserver. At that point the active scans page shows two active scans, one on the dead scan server and that same scan on a tserver. This seems like a bug to me.

DomGarguilo commented 1 week ago

Regarding my previous comment:

I looked into the other spots where data was being pulled from these maps in the Monitor code (tserverScans, sserverScans, and allCompactions). Unless I am missing something, the only one that seems to use this stale data before it is aged off is the scan servers scenario outlined above. This leads me to believe that we might not need to keep this age off functionality unless it is deemed necessary for the scan server scans info.