catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
https://chromium.googlesource.com/catapult
BSD 3-Clause "New" or "Revised" License
1.92k stars 564 forks source link

[📍] Speed up Job loading #4437

Closed dave-2 closed 6 years ago

dave-2 commented 6 years ago

JobState.AsDict() takes about ½ a second. For something like the stats page, you may want to load hundreds or thousands of jobs, so the page takes minutes to load.

I've found that ~80% of the time is in loading the repositories dict using namespaced_stored_object. Commit.AsDict() looks up repository_url twice, and takes 0.01 s - 0.02 s per lookup. 15 Commits × 2 lookups × 0.015 s = 0.45 s. Theoretically, this operation should always hit the in-context cache and take microseconds per lookup.

We could either try to speed up namespaced_stored_object or create a Repository ndb entity and put the repositories into the Datastore directly.

@anniesullie @simonhatch

simonhatch commented 6 years ago

Any idea what in there is the slowdown? Guessing it's the deserialization on each access, in which case caching the deserialized data would be an easy route.

dave-2 commented 6 years ago

I don't think it's the deserialization. Recall that there was a deeper investigation into stored_object in #3834. I believe it's partially due to the multiple fetches for the different PartEntitys (2x slowdown?), partially due to the namespace lookup (2x+ slowdown?), partially due to the overhead of using all Async methods (2x-3x slowdown), and partially due to caching behavior.

I think the easiest thing would be to create a Repository ndb entity. Something simple like this takes 0.0003 ms -- a 50x speedup.

class Repository(ndb.Model):
  urls = ndb.StringProperty(repeated=True)

Repository(id='chromium', urls=['https://chromium.googlesource.com/chromium/src']).put()

start_time = time.time()
ndb.Key(Repository, 'chromium').get().urls[0]
print time.time() - start_time