Confirm-Solutions / confirmasaurus

3 stars 0 forks source link

Scale queries to 100 billion tiles. #329

Open tbenthompson opened 1 year ago

tbenthompson commented 1 year ago

New thoughts

CHUNKING. CHUNKING. CHUNKING. CHUNKING. CHUNKING. CHUNKING. CHUNKING. We can track a minimum of the bootstrapped lambda for every chunk. We can easily select the chunks of the lowest lambda values.

This is somewhat inefficient because some chunks will contain tiles that don’t need to be refined or deepened. But the loss of efficiency might be in the single digits percent. When we insert chunks, we can group tiles so that chunks will all very likely either need to be refined/deepened or not.

If we simulate 100k tiles in a packet sent to a worker and we split that into 100 chunks so that the tile grouping can be tight (tiles in the group all generally will either need to be refined/deepened or will not need that), then the chunks will 1000 tiles each and the cost of all our queries will drop by a factor of 1000.

Original thoughts

Scaling to 100 billion tiles requires another architectural change to what we're doing.

Tile selection:

Minimizing bootstrapped lambda* The more severe problem is taking the minimum of the bootstrapped lambda values for each bootstrap. Taking a minimum over 50 different columns is a similar challenge to the ones above and all the same solutions will help. But it's more difficult because of the sheer amount of data: 50 8 * 100e9 = 40 TB of data. So the minimum query will require being smart about excluding large subsets of data and caching intermediate results. All while working with a pile of data that doesn’t fit on a single disk… This problem really fits well with having a distributed database. Otherwise, we will need to fight with a lot of shit.

Solutions:

  1. Clever table structuring in Clickhouse. One idea is to have two separate tables. One table of "important" tiles that might be refined in the next few steps and another table of tiles that are likely to be unimportant because they are not anywhere close to being the most important tiles.
  2. Consider other distributed databases that might be more amenable to deletes and updates. DynamoDB comes to mind here. Replacing the database layer is not an insane amount of work.
  3. Embrace the imprecision of the Adagrid process. At every step, it's more important to be doing something useful than to be doing the optimally useful thing. So, having an imprecise calculation of bias and an imprecise tile selection process is okay. At the end of the day, two things matter: 1) having a good guess of the lambda** value that will come out of the final resimulation of the final grid. 2) being deterministic/reproducible.
    • so this suggests that we could just treat adagrid as maxing out at 5 billion tiles or so. When we go above that, we can just loop through the separate adagrid problems in sequence and improve each "subproblem" individually.
    • when we insert tiles into the database, we could group them into chunks of 1000 tiles or so that have generally similar lambda values. Then, we select groups of tiles by the minimum lambda of that group. This kind of technique should intuitively reduce the scale of all the queries by about a factor of the chunk size while mildly reducing the efficiency.
  4. I'm sure there are more ideas!!
tbenthompson commented 1 year ago

James has a cool clustering variant of this idea: https://kevlar-talk.slack.com/archives/C03DEHDKGM6/p1680799432470429?thread_ts=1680797471.709839&cid=C03DEHDKGM6