catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
https://chromium.googlesource.com/catapult
BSD 3-Clause "New" or "Revised" License
1.92k stars 564 forks source link

Dashboard - Do we need both indexes for internal_only? #4440

Closed simonhatch closed 5 years ago

simonhatch commented 6 years ago

Do we need both of these indexes? Isn't it sufficient to declare one composite index with internal_only, and just not use that field when doing privileged queries?

@anniesullie @eakuefner

anniesullie commented 6 years ago

I think the composite one is the only one that is needed. But I don't know how to test.

simonhatch commented 6 years ago

I meant that index.yaml typically has 2 composite indexes for each query, with and without internal_only, but not clear if we need the one that doesn't have internal_only

anniesullie commented 6 years ago

Ah, I see. At least when I tried it last, the queries that didn't check internal_only would fail without their own index. I may have been holding it wrong?

simonhatch commented 6 years ago

Ahh ok, was wondering if they could somehow be collapsed instead of having 2 of each. Sounds like maybe they can't.

simonhatch commented 6 years ago

Had a random thought after reading Ben's alert's api cl utilizing the zigzag merge feature, could maybe just turn the datastore PreCallHook to a PostCallHook and filter out the internal_only as a post step, that'd let us remove this double index thing.

benshayden commented 6 years ago

I would be concerned that filtering by internal_only post-hoc rather than in the query would artificially limit the number of results. The fetch limit is applied after the query filters but before the post-hoc filtering. You could run the query in a loop until you find enough matching entities, but that could significantly impact request latency.

Alternatively, you could restructure the indexes to take advantage of zigzag merge. This should significantly reduce the storage size of the indexes, and only increase query latency slightly.

- kind: TestMetadata
  properties:
  - name: master_name
  - name: bot_name
  - name: suite_name
  - name: test_part1_name
  - name: test_part2_name
  - name: test_part3_name
  - name: test_part4_name
  - name: test_part5_name
  - name: key
- kind: TestMetadata
  properties:
  - name: internal_only
  - name: key

I'm actually not sure about that name: key part. I just noticed that list_tests orders by key. I'm also not sure if that actually has any effect. If the name: key part is not necessary, then you don't need to explicitly define an index over just internal_only because the datastore automatically provides single-property indexes as a built-in.

Alternatively2, you could use an inequality filter for internal_only and remove the index that doesn't include internal_only altogether. This would halve the index storage cost. The only downside is that queries cannot contain more than one inequality filter, so you wouldn't be able to use this trick for any other properties.

query = query.filter(graph_data.TestMetadata.internal_only != None)

That would match both internal_only=True and internal_only=False.

Alternatively3, we could hold off refactoring the TestMetadata indexes until we can discuss how V2SPA's descriptor concept might affect the shape of that Model.

benshayden commented 6 years ago

Alternatively4, I think we can remove all TestMetadata indexes safely. Indexes are only necessary when ordering by something (other than key ascending, which is the default order) or using an inequality filter. When not ordering (or ordering by key ascending) and not using an inequality filter, it doesn't appear to require a manual index. I don't see any TestMetadata queries that order by something other than key ascending or use inequality filters. I can start sending out some CLs to remove one TestMetadata index at a time, and wait to vacuum them (one per weekend) starting the weekend after next, if that sounds ok? I can also start looking at the Row indexes and queries since those are the biggest.

benshayden commented 6 years ago

Row composite indexes:

  1. parent_test, revision, value
  2. parent_test, -revision, value
  3. parent_test, revision, timestamp, value
  4. parent_test, -revision
  5. parent_test, revision
  6. parent_test, -timestamp

Each of those 6 composite indexes costs about 2TB.

Row queries:

Recommendations:

simonhatch commented 6 years ago

Update from chat w/ Ben: Next steps, resave all suite-level entities and strip the monitored property, then retest the query from update_test_suites to see if we can remove the projection query.

benshayden commented 5 years ago

Moved to https://bugs.chromium.org/p/chromium/issues/detail?id=918191