[Main network] Performance issues with Red-Black Tree

ByronBecker commented 2 years ago

@matthewhammer @crusso

I've recently moved from doing local testing to main network testing, and one of the large difference between testing locally and on the main network are limitations such as the instruction limit for single message execution.

The issue

On the main network, at a certain point (less than the canister heap size or the canister memory size), Red-Black Trees are unable to insert more records and hit the instruction limit for single message execution.

The error.

Once I have around 290-500MB (heap size) memory in my canister, insertions of ~30-80KB per message will randomly start to return the following error

Error: Call was rejected:
  Request ID: 0391134e282aa22d5943fafb6db661b34e46ef14f4efb5e12342910a17ac03f1
  Reject code: 4
  Reject text: IC0522: Canister t5uw7-jaaaa-aaaan-qajmq-cai exceeded the instruction limit for single message execution.

What I was doing at the time of the error

For example,

At this heap utilization, I was chunk uploading 100 rows at a time (30-80KB), and then inserting in the manner shown in the "Some Context" section below. All of a sudden, I received the error shown in the "The issue" section above, and was no longer able to perform insertions into the Red-Black Tree without receiving that error. I was able to upgrade the canister and update other stable variables, but nothing more could be inserted into the Red-Black tree.

I even made a special csv with just 1 row at this point and tried to insert it into the Red-Black Tree on this 484MB heap sized canister and still received the same error

The script I run

> npx ts-node load/index.ts && npm run alert-done

test pk []
canisterId []
comment canisters [ 't5uw7-jaaaa-aaaan-qajmq-cai' ]
Parsed 1 rows
done!
uploading chunk1154a.csv
rows.length = 1
Error: Call was rejected:
  Request ID: ca51682ab909115de27018c686f1f07f95810d7f522cdb72f42a18745a021b3c
  Reject code: 5
  Reject text: Canister t5uw7-jaaaa-aaaan-qajmq-cai exceeded the instruction limit for single message execution.

    at pollForResponse (/Users/byronbecker/Workspace/Dfinity_Projects/supernova-candb-demo/backfill/node_modules/@dfinity/agent/src/polling/index.ts:61:13)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async caller (/Users/byronbecker/Workspace/Dfinity_Projects/supernova-candb-demo/backfill/node_modules/@dfinity/agent/src/actor.ts:360:29)
    at async go (/Users/byronbecker/Workspace/Dfinity_Projects/supernova-candb-demo/backfill/load/index.ts:58:5)

My guess is that this could either be due to inefficiencies in re-balancing the Red-Black Tree, or that the tree has gotten so deep that recursive operations are much more costly by (more than linear factor) due to the call stack or some other factor.

Some More Context

My database project, CanDB, is a Red-Black tree of type <Text, AttributeValue> where AttributeValue is a stable primitive.

I'm testing out backfill-ing a large amount of data into CanDB, from which ~80KB csv chunks are formatted and uploaded to the following API endpoint, processComments().

(For reference, the awaits in the CanDB.put() method are only ever called once in a canister's lifetime, and this is for scaling out a new canister. This means that every CanDB.put() method is synchronous except in the one time it needs to signal that it needs to be scaled out)

 public func processComments(comments: [CommentRequestInput]): async () {
    for (c in comments.vals()) {
      // allow for query by timestamp 
      await CanDB.put(db, {
        sk = "timestamp#" # c.isoString # "#id#" # c.commentId;
        attributes = [
          ("rowNumber", #text(c.rowNumber)),
          ("commentId", #text(c.commentId)),
          ("subredditId", #text(c.subredditId)),
          ("subredditName", #text(c.subredditName)),
          ("createdUTC", #text(c.createdUTC)),
          ("permalink", #text(c.permalink)),
          ("body", #text(c.body)),
          ("sentiment", #float(c.sentiment)),
          ("score", #int(c.score)),
        ]
      });
      // allow for query by subreddit and timestamp
      await CanDB.put(db, {
        sk = "subreddit#" # c.subredditName # "#timestamp#" # c.isoString # "#id#" # c.commentId;
        attributes = [
          ("rowNumber", #text(c.rowNumber)),
          ("commentId", #text(c.commentId)),
          ("subredditId", #text(c.subredditId)),
          ("subredditName", #text(c.subredditName)),
          ("createdUTC", #text(c.createdUTC)),
          ("permalink", #text(c.permalink)),
          ("body", #text(c.body)),
          ("sentiment", #float(c.sentiment)),
          ("score", #int(c.score)),
        ]
      });
      // allow for query by score (top voted) 
      await CanDB.put(db, {
        sk = "score#" # c.lexicographicScore # "#id" # c.commentId;
        attributes = [
          ("rowNumber", #text(c.rowNumber)),
          ("commentId", #text(c.commentId)),
          ("subredditId", #text(c.subredditId)),
          ("subredditName", #text(c.subredditName)),
          ("createdUTC", #text(c.createdUTC)),
          ("permalink", #text(c.permalink)),
          ("body", #text(c.body)),
          ("sentiment", #float(c.sentiment)),
          ("score", #int(c.score)),
        ]
      })
    }

If you are interested in the code for CanDB.put(), you can find it here, and the RangeTree data structure that sits on top of a Red-Black as well.

Recommendations

I'm curious if digging into the recursive (performance) aspects of the functional data structures in motoko-base might find "easy-win" areas of improvement to avoid the message execution instruction limit.

Also, in addition to the local testing, it would be a good idea to do some main network testing with respect to the performance of imperative, mutable data structure implementations (Red-Black Tree, BTree) against their functional counterparts.

My takeaways

While I enjoy many of the functional aspects and guarantees of the implementations, if I'm releasing a database product the bottom line for me a data structure that provides consistent performance.

In the coming weeks, I'm also going to test the limitations a bit more, as well as look to see if I can make certain optimizations like putting a short hash table or Trie at the top that might reduce any rebalancing or stack depth issues.

matthewhammer commented 2 years ago

Thanks for this report, @ByronBecker!

I wonder if you could repeat the experiment with another existing base structure in place of the red-black tree.

If you are only inserting, for instance, it would be interesting to use lists, and always prepend (in constant time). Since we know that will always be constant time, it would be an interesting comparison to see what limit, if any, you reach with that data structure in place of where you currently use the RBTree. WDYT?

BTW -- I have been preoccupied with family life recently, but still plan to implement the imperative B-Tree asap. Of course, that comparison will be interesting too, when we can do it.

ByronBecker commented 2 years ago

Right, I could totally test out the LinkedList - the only issue is that I also would like performant lookups, as I expect the database to be more read-heavy than write heavy. O(n) lookups would most likely hit the instruction limit once the data is large enough

matthewhammer commented 2 years ago

Yep, makes sense that linked lists will be impractical. But they would be useful (in the insertion-only workload) for seeing what the best possible performance could be, assuming you don't worry about the order as you insert. It's an experiment to get a lower bound for insertion instruction cost (no rebalancing, and only storing data.). OTOH, as we discussed elsewhere, this version is not actually useful for implementing CanDB, so it would only be to gather data and get this bound.

What Motoko GC are you using?

ByronBecker commented 2 years ago

Now I understand a bit better what you were saying - that would be a good comparison to see also if the canister heap size has any impact on what should be an O(1) operation.

I've been using the default that ships with 0.10.0 - is that --copying-gc?

Should I try to use this https://github.com/dfinity/motoko/pull/3231?

ggreif commented 2 years ago

I've been using the default that ships with 0.10.0 - is that --copying-gc?

Yep, we cannot make --compacting-gc the default until the replica supports DTS (deterministic time slicing).

crusso commented 2 years ago

@ByronBecker I can't see the code where you scale and presumably instantiate another canister. Is it possible that the spun up canisters aren't being provisioned with enough cycles before calling the class constructor?

ByronBecker commented 2 years ago

@crusso This is where the canister gets spun up https://github.com/canscale/CanDB/blob/auto-scaling-poc/src/CanDB.mo#L173

func scaleCanister(db: DB): async () {
    let indexCanister = actor(db.scalingOptions.autoScalingCanisterId): actor { createAdditionalCanisterForPK: shared (pk: Text) -> async Text };
    try {
      let newCanisterId = await indexCanister.createAdditionalCanisterForPK(db.pk);
      Debug.print("canister creation success for pk=" # db.pk # ", canisterId=" # newCanisterId)
    } catch (err) {
      db.scalingStatus := #not_started;
      Debug.print("error auto-scale creating canister in CanDB: " # Error.message(err));
    };
  };

This line let newCanisterId = await indexCanister.createAdditionalCanisterForPK(db.pk); calls a method that it expects the index canister to have. I've created a templating tool that requires this method to be there on the canister responsible for auto-scaling

Here's how that createAdditionalCanisterForPK tool is used

Template tool -> https://github.com/canscale/candb-quickstart-template/blob/main/tools/templates/IndexCanister.mo#L38
Very basic flushed out example of what this looks like in an Index Canister -> https://github.com/canscale/candb-client-typescript/blob/main/example/src/index/main.mo#L63

Claudio I just gave you access to both of these repos (candb-client-typescript & candb-quickstart-template). Others can request if interested.

I can almost assure you that this is not a cycles issues as canister that I spun up from the index canister and was receiving and inserting into the RBTree has not run out of cycles (it had over 6T cycles before I spun it down).

ByronBecker commented 2 years ago

Also, this happened before the canister scaling limit I had set of heap size at 1.25GB. The error occurred several times in a range of heap size of 290MB-500KB, eventually topping out at 490MB where I was unable to insert a single row without hitting this limit.

crusso commented 2 years ago

Is it possible there's a bug in RBT code that causes it to go into an infinite loop that is broken by the cycles bound?

ByronBecker commented 2 years ago

My guess was this

"My guess is that this could either be due to inefficiencies in re-balancing the Red-Black Tree, or that the tree has gotten so deep that recursive operations are much more costly by (more than linear factor) due to the call stack or some other factor."

I did a pretty large amount of testing for my RangeTree data structure(essentially a typed RBTree), and the code keeps everything ordered, but it remains to be seen how many of these data structures respond on the main network and large amounts of data (i.e. 1-2GB Heap size). Much of the feedback is anecdotal from what I've seen.

I did not, however test the node ordering to ensure that the tree stays balanced

crusso commented 2 years ago

Tbh, I'm at a loss. The one thing I'm hoping is that Deterministic Time Slicing will free us of these performance cliffs.

crusso commented 2 years ago

Acc to wikipedia the RB tree operations should be worst case 0(log n), so I'm not sure what could be going wrong unless there's a bug.

Is it possible you've got some expensive debug code still enabled by mistake, perhaps producing large text representation of the tree?

ByronBecker commented 2 years ago

I do have the Debug code enabled and running (both locally and when deployed to the main network.

When I run this locally this is the logging I'm seeing

[Canister xcrtx-5aaaa-aaaaa-aahia-cai] creating new comment canister with pk=comment_2021-07-15
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=235
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xcrtx-5aaaa-aaaaa-aahia-cai] index prinicpal = xcrtx-5aaaa-aaaaa-aahia-cai
[Canister xzups-hyaaa-aaaaa-aahkq-cai] received comments of size=500
...
... keeps going for each new message upload

Where received comments of size=<xxx> is what's being run for each upload batch (message)

crusso commented 2 years ago

I noticed that the implementation of put does an ignore so it won't wait for the update to complete. Is that deliberate? Could that somehow flood the canister with puts that don't get throttled by waiting for them.

crusso commented 2 years ago

BTW, since dfx 0.10.0, you should be able to run on a local replica with accurate cycle accounting and budgets, at least according to https://forum.dfinity.org/t/dfx-0-10-0-is-promoted/12870?u=claudio.

Have you tried?

ByronBecker commented 2 years ago

I noticed that the implementation of put does an ignore so it won't wait for the update to complete. Is that deliberate? Could that somehow flood the canister with puts that don't get throttled by waiting for them.

Definitely a possibility and potential improvement. However, in my initial report,

"I even made a special csv with just 1 row at this point and tried to insert it into the Red-Black Tree on this 484MB heap sized canister and still received the same error"

It could still definitely be the case that for the single row, the 3 subsequent calls at this point overwhelm the replica, but then why was I able to insert 100 rows in the prior round with this same code and no issue?

BTW, since dfx 0.10.0, you should be able to run on a local replica with accurate cycle accounting and budgets, at least according to https://forum.dfinity.org/t/dfx-0-10-0-is-promoted/12870?u=claudio.

Have you tried?

I am using 0.10.0, and made a very weak attempt, but have not successfully gotten this working. Will try again getting it working in the coming week.

ByronBecker commented 2 years ago

Also, according to the definition of Ignore https://internetcomputer.org/docs/current/developer-docs/build/languages/motoko/language-manual#ignore

The expression ignore <exp> evaluates <exp> (typically for some side-effect) but discards its value.

Ignore is useful for evaluating an expression within a sequence of declarations when that expression has non-unit type (and the simpler declaration would be ill-typed). Then the semantics is equivalent to let _ = <exp> : Any.

This makes is sound like ignore would evaluate the full expression first (including if there was any asynchronous behavior), and then discard it.

Still, you make a good point that the following three versions of this function could have different behavior in terms of how they are executed by the caller

// 1)
public func put(db: DB, options: PutOptions): async () {
  ignore replace(db, options);
}

// 2)
public func put(db: DB, options: PutOptions): async () {
  ignore await replace(db, options)
}

// 3)
public func put(db: DB, options: PutOptions): async () {
  let _ = await replace(db, options);
}

I'm assuming you'd recommend option/version 3, but I'm curious how 2 vs. 3 would be interpreted (i.e. for 2, would ignore await on the inner expression to execute, essentially making 2 & 3 the same?)

crusso commented 2 years ago

2 and 3 are essentially equivalent and force the put to wait until the (async) replace has completed before returning to the caller.

1 differs from 2 and 3 in just initiating the replace, but not waiting for it to complete before returning (and discarding both the result and any errors thrown by replace).

In words:

1 says queue the replace message or function call and return immediately, without waiting for that call to complete or propagating its errors.

2 and 3 say queue the replace message and wait for the replace to complete, discarding the result but propagating any errors.

timohanke commented 1 year ago

@ByronBecker Did this get resolved? I am certain it is the GC that causes the limit. At the heap size that you mentioned there are most likely sufficiently many objects on the heap. The GC touches them all and doing so reaches the cycle limit. You reached a point where GC is going to be triggered by the very next message, even if the message is short (a single insertion). It has nothing to do with the data structure. You will hit the same limit if you store your data in an array or some other way.

ByronBecker commented 1 year ago

@timohanke

I've been communicating with the Motoko team over the past 8 months or so with the goal of to resolving these issues. Deterministic Time Slicing is helping a lot to kick limitations down the road, and @luc-blaeser is doing some great work to improve the GC with his generational GC and an in-progress incremental GC.

We've come a ways since I first ran into this during Supernova, but there are still limits and I'm hopeful that DTS limits being raised and the new incremental GC will get us to 4GB (heap max)

dfinity / motoko-base