Dashboard - Do we need both indexes for internal_only?

simonhatch commented 6 years ago

Do we need both of these indexes? Isn't it sufficient to declare one composite index with internal_only, and just not use that field when doing privileged queries?

@anniesullie @eakuefner

anniesullie commented 6 years ago

I think the composite one is the only one that is needed. But I don't know how to test.

simonhatch commented 6 years ago

I meant that index.yaml typically has 2 composite indexes for each query, with and without internal_only, but not clear if we need the one that doesn't have internal_only

anniesullie commented 6 years ago

Ah, I see. At least when I tried it last, the queries that didn't check internal_only would fail without their own index. I may have been holding it wrong?

simonhatch commented 6 years ago

Ahh ok, was wondering if they could somehow be collapsed instead of having 2 of each. Sounds like maybe they can't.

simonhatch commented 6 years ago

Had a random thought after reading Ben's alert's api cl utilizing the zigzag merge feature, could maybe just turn the datastore PreCallHook to a PostCallHook and filter out the internal_only as a post step, that'd let us remove this double index thing.

benshayden commented 6 years ago

I would be concerned that filtering by internal_only post-hoc rather than in the query would artificially limit the number of results. The fetch limit is applied after the query filters but before the post-hoc filtering. You could run the query in a loop until you find enough matching entities, but that could significantly impact request latency.

Alternatively, you could restructure the indexes to take advantage of zigzag merge. This should significantly reduce the storage size of the indexes, and only increase query latency slightly.

- kind: TestMetadata
  properties:
  - name: master_name
  - name: bot_name
  - name: suite_name
  - name: test_part1_name
  - name: test_part2_name
  - name: test_part3_name
  - name: test_part4_name
  - name: test_part5_name
  - name: key
- kind: TestMetadata
  properties:
  - name: internal_only
  - name: key

I'm actually not sure about that name: key part. I just noticed that list_tests orders by key. I'm also not sure if that actually has any effect. If the name: key part is not necessary, then you don't need to explicitly define an index over just internal_only because the datastore automatically provides single-property indexes as a built-in.

Alternatively2, you could use an inequality filter for internal_only and remove the index that doesn't include internal_only altogether. This would halve the index storage cost. The only downside is that queries cannot contain more than one inequality filter, so you wouldn't be able to use this trick for any other properties.

query = query.filter(graph_data.TestMetadata.internal_only != None)

That would match both internal_only=True and internal_only=False.

Alternatively3, we could hold off refactoring the TestMetadata indexes until we can discuss how V2SPA's descriptor concept might affect the shape of that Model.

benshayden commented 6 years ago

Alternatively4, I think we can remove all TestMetadata indexes safely. Indexes are only necessary when ordering by something (other than key ascending, which is the default order) or using an inequality filter. When not ordering (or ordering by key ascending) and not using an inequality filter, it doesn't appear to require a manual index. I don't see any TestMetadata queries that order by something other than key ascending or use inequality filters. I can start sending out some CLs to remove one TestMetadata index at a time, and wait to vacuum them (one per weekend) starting the weekend after next, if that sounds ok? I can also start looking at the Row indexes and queries since those are the biggest.

benshayden commented 6 years ago

Row composite indexes:

parent_test, revision, value
parent_test, -revision, value
parent_test, revision, timestamp, value
parent_test, -revision
parent_test, revision
parent_test, -timestamp

Each of those 6 composite indexes costs about 2TB.

Row queries:

api/timeseries.TimeseriesHandler.AuthorizedPost: parent_test=, timestamp>,
deprecate_tests._CheckTestForDeprecationOrRemoval: parent_test=, order(-timestamp)
dump_graph_json._DumpTestData: parent_test=, revision<=, order(-revision)
dump_graph_json._FetchRowsAsync: parent_test=, order(-revision)
find_anomalies.GetRowsToAnalyzeAsync: parent_test=, revision>, order(-revision), projection=['revision', 'value']
find_anomalies._HighestRevision: parent_test=, order(-revision)
graph_csv.GraphCsvHandler.get: parent_test=, revision<=, order(-revision)
graph_revisions._UpdateCache: parent_test=, projection=['revision', 'value', 'timestamp']
graph_data.GetRowsForTestInRange: parent_test=, revision>=, revision<=
graph_data.GetRowsForTestBeforeAfterRev: parent_test=, revision<=, order(-revision)
graph_data.GetRowsForTestBeforeAfterRev: parent_test=, revision>, order(+revision)
graph_data.GetLatestRowsForTest: parent_test=, order(-revision)
speed_releasing._UpdateNewestRevInMilestoneDict: parent_test=, order(-revision)

Recommendations:

Index 1 can be cleaned up immediately. I'll send out the CL tomorrow and wait to run the command on a quiet Friday.
Indexes 2 and 3 can be replaced by a single new index (parent_test, -revision, timestamp, value) if find_anomalies.GetRowsToAnalyzeAsync adds timestamp to its projection and graph_revisions._UpdateCache orders by -revision, or (parent_test, -revision, -timestamp, value) if _UpdateCache would rather order by -timestamp. This churn might not be worth the savings. Alternatively, these indexes can be removed altogether if the projections are not critical for performance, which might be likely when V2SPA caches timeseries data correctly.
Indexes 4, 5, and 6 can be replaced by a single new index (parent_test, -revision_timestamp) after revision timestamps #4218 , unless GetRowsForTestBeforeAfterRev requires another index (parent_test, +revision_timestamp). This would also be a significant change, but probably would be worth it since revision timestamps could significantly simplify lots of things.

simonhatch commented 6 years ago

Update from chat w/ Ben: Next steps, resave all suite-level entities and strip the monitored property, then retest the query from update_test_suites to see if we can remove the projection query.

benshayden commented 5 years ago

Moved to https://bugs.chromium.org/p/chromium/issues/detail?id=918191

catapult-project / catapult

Dashboard - Do we need both indexes for internal_only? #4440