catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
https://chromium.googlesource.com/catapult
BSD 3-Clause "New" or "Revised" License
1.93k stars 564 forks source link

[📍] Job went off the rails #4393

Open dave-2 opened 6 years ago

dave-2 commented 6 years ago

https://pinpoint-dot-chromeperf.appspot.com/job/12bd964f440000 Error: The request failed with status code: 500

@simonhatch and I dug into the job a bit, and it looks like it has 2 Changes, each with 27590 Attempts.

dave-2 commented 6 years ago

Looks like there's an unusual edge case when the first few Attempts produce values, and then all the subsequent Attempts fail. The Job keeps running, hoping more Attempts will eventually produce enough values to establish confidence.

One of the discrepancies here is that _CompareValues() uses the number of values to determine when to stop running, not the number of attempts. Just adjusting that number is probably enough here.

simonhatch commented 6 years ago

Managed to pull out a list of jobs from datastore that look like they were affected by this. I've deployed a new version of Pinpoint, I'll go over these and restart them.

[
  {
    "bug_id": 821024, 
    "attempts": 61350, 
    "job_id": "16c0a540c40000"
  }, 
  {
    "bug_id": 821411, 
    "attempts": 60620, 
    "job_id": "14f0e920c40000"
  }, 
  {
    "bug_id": 828412, 
    "attempts": 52450, 
    "job_id": "16f225e0c40000"
  }, 
  {
    "bug_id": 828411, 
    "attempts": 60940, 
    "job_id": "15b5603f440000"
  }, 
  {
    "bug_id": 828396, 
    "attempts": 62090, 
    "job_id": "14d8e580c40000"
  }, 
  {
    "bug_id": 828020, 
    "attempts": 59110, 
    "job_id": "11e9eabf440000"
  }, 
  {
    "bug_id": 828286, 
    "attempts": 15360, 
    "job_id": "12bcd990c40000"
  }, 
  {
    "bug_id": 828281, 
    "attempts": 58580, 
    "job_id": "16f24340c40000"
  }, 
  {
    "bug_id": 828025, 
    "attempts": 21520, 
    "job_id": "118035cf440000"
  }, 
  {
    "bug_id": 821024, 
    "attempts": 60790, 
    "job_id": "16977b00c40000"
  }, 
  {
    "bug_id": 818689, 
    "attempts": 61390, 
    "job_id": "1590d23f440000"
  }, 
  {
    "bug_id": 828025, 
    "attempts": 21700, 
    "job_id": "11905b9f440000"
  }, 
  {
    "bug_id": 828021, 
    "attempts": 51510, 
    "job_id": "14e88080c40000"
  }, 
  {
    "bug_id": 828017, 
    "attempts": 61350, 
    "job_id": "1492e73f440000"
  }, 
  {
    "bug_id": 828014, 
    "attempts": 55180, 
    "job_id": "12bd964f440000"
  }, 
  {
    "bug_id": 828011, 
    "attempts": 54540, 
    "job_id": "11d623cf440000"
  }, 
  {
    "bug_id": 821894, 
    "attempts": 47310, 
    "job_id": "12f72ecb440000"
  }, 
  {
    "bug_id": 825105, 
    "attempts": 54510, 
    "job_id": "1417f43d440000"
  }, 
  {
    "bug_id": 821024, 
    "attempts": 13620, 
    "job_id": "15ad39e3440000"
  }, 
  {
    "bug_id": 824180, 
    "attempts": 59710, 
    "job_id": "16d140b3440000"
  }, 
  {
    "bug_id": 821413, 
    "attempts": 60660, 
    "job_id": "16c95123440000"
  }, 
  {
    "bug_id": 825600, 
    "attempts": 16760, 
    "job_id": "17e7ab53440000"
  }, 
  {
    "bug_id": 818618, 
    "attempts": 58630, 
    "job_id": "1597413d440000"
  }, 
  {
    "bug_id": 825601, 
    "attempts": 17210, 
    "job_id": "148f2e33440000"
  }, 
  {
    "bug_id": 814225, 
    "attempts": 31850, 
    "job_id": "15c77505440000"
  }, 
  {
    "bug_id": 821110, 
    "attempts": 46500, 
    "job_id": "14927319440000"
  }, 
  {
    "bug_id": 820098, 
    "attempts": 57230, 
    "job_id": "14c585b9440000"
  }, 
  {
    "bug_id": 821396, 
    "attempts": 56630, 
    "job_id": "12e57211440000"
  }, 
  {
    "bug_id": 814222, 
    "attempts": 36220, 
    "job_id": "12a68231440000"
  }, 
  {
    "bug_id": 814704, 
    "attempts": 49070, 
    "job_id": "149cf8a1440000"
  }, 
  {
    "bug_id": 821772, 
    "attempts": 58610, 
    "job_id": "148504de440000"
  }, 
  {
    "bug_id": 806824, 
    "attempts": 56780, 
    "job_id": "12b35d76440000"
  }, 
  {
    "bug_id": 821410, 
    "attempts": 58730, 
    "job_id": "12bf058e440000"
  }, 
  {
    "bug_id": 821019, 
    "attempts": 36560, 
    "job_id": "16e7f076440000"
  }, 
  {
    "bug_id": 820467, 
    "attempts": 19260, 
    "job_id": "149b9a96440000"
  }, 
  {
    "bug_id": 687695, 
    "attempts": 58050, 
    "job_id": "129f75ca440000"
  }, 
  {
    "bug_id": 820088, 
    "attempts": 13100, 
    "job_id": "12d75bca440000"
  }, 
  {
    "bug_id": 820004, 
    "attempts": 39120, 
    "job_id": "12e69552440000"
  }, 
  {
    "bug_id": 814253, 
    "attempts": 52380, 
    "job_id": "12c4e692440000"
  }, 
  {
    "bug_id": 818618, 
    "attempts": 104760, 
    "job_id": "148fe1ec440000"
  }
]
simonhatch commented 6 years ago

I just restarted one to make sure it goes through before running the rest.

dave-2 commented 6 years ago

We should also delete all the extra attempts.

for attempts in job.state._attempts.itervalues():
  del attempts[120:]
simonhatch commented 6 years ago

Ok these are all cleaned up, they just need to be restarted.

simonhatch commented 6 years ago

Ok a bunch of these have been restarted, I actually just wrote a script to restart them but only 1 per bug.