indexes should be based on queries, not columns

davepacheco commented 10 years ago

Currently, to create an index, users specify which columns of the raw data that they want to index. The idea is that you can specify N dimensions, and we keep enough information to serve queries that filter or break out on any combination of those dimensions. Obviously, that pretty quickly gets quite large as N grows, and in practice a single 32-bit Node process doesn't have enough heap space to store to the aggregated results of more than a day's worth of Manta logs for a useful set of 7 or 8 fields.

The initial design of Dragnet was to support dashboards, reports, and threshold-based alarms -- all of which are more constrained because you know ahead of time which queries you're going to want to make. If instead of asking for an index on a set of columns, you gave a list of queries (each of which would only have up to three dimensions in practice), it would take a lot less memory to build an index database that could serve all of those queries.

You could argue that Dragnet already does this: you can already create N indexes by hand, one for each query you want to run (and that was kind of the original idea). The assumption for this ticket is that it's useful for the user notion of an index to serve a bunch of queries, and to be able to scan the data just once to produce a bunch of indexes. That said, the flip side is that if the user wants to add a new query, we may end up regenerating the index for all of the queries, but there are probably ways of making that more reasonable (updating the existing sqlite database instead of adding a new one).

davepacheco commented 10 years ago

There are a few other options for solving the memory usage problem (run 64-bit Node with a large heap; look at reducing the memory usage for aggregated tuples; or avoiding storing data in memory and use a sqlite upsert mechanism instead), but none of them address the problem that the existing approach creates a ton of tuples, and that's always going to be painful. My napkin calculation suggests that the current approach could reasonably generate 276B rows on a day's data, while 20 or so specific-query indexes would generate about 90M rows.

davepacheco commented 10 years ago

I just integrated a pretty big rewrite to make Dragnet work this way.

TritonDataCenter / dragnet

indexes should be based on queries, not columns #22