catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
https://chromium.googlesource.com/catapult
BSD 3-Clause "New" or "Revised" License
1.92k stars 564 forks source link

[📍] Jobs stopped running #4435

Open dave-2 opened 6 years ago

dave-2 commented 6 years ago

https://pinpoint-dot-chromeperf.appspot.com/job/11c5879cc40000

Saw a cluster of these around 4/10/18, 11:30 am PDT. The log shows:

Request was aborted after waiting too long to attempt to service your request.

I think this could happen if the task queue rate is set too low. Right now the task queue rate is set at 1/s, so e.g. if there's more than 60 jobs, each one polls once per minute.

@anniesullie

dave-2 commented 6 years ago

What we really want is PubSub (#3900)

simonhatch commented 6 years ago

Hmm why would the queue rate limit cause these requests to abort? Wouldn't they just sit in the queue until being allowed to run?

dave-2 commented 6 years ago

I don't know. Do tasks sit in the task queue indefinitely? That seems unlikely, but I didn't find anything in the documentation about task expiration. If we continue seeing this error, we can increase the retry count as well.

simonhatch commented 6 years ago

I don't know about indefinitely, I don't recall seeing any kind of expiry on them either and I know in migrations we can end up with 10's of thousands of tasks that may end up sitting for hours or more.

This looks pretty similar to #4408 , yeah I'd think adding some retries would probably help if we're still getting this error later.