apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.95k stars 979 forks source link

DRILL-8511: Overflow appeared when the batch reached rows limit #2943

Open rymarm opened 2 months ago

rymarm commented 2 months ago

DRILL-8511: Overflow appeared when the batch reached rows limit

Description

The size-aware scan framework fails to end the batch.

Framework tries to reallocate the vector on batch end, due to a hidden, minor bug in BitColumnWriter - which in general is not notable, but in a specific case, when the initial vector allocation size limit is exceeded and a reader reaches the batch row size limit.

BitColumnWriter uses instead of a write index a value count and this causes unexpected vector reallocation (look at the changes).

Documentation

No changes required.

Testing

Manual tests

paul-rogers commented 2 months ago

@rymarm, thanks for finding this issue. I believe your fix is correct. The prepareWrite() call exists to fill any remaining bits with 0s (i.e. "fill empties."). The number to be filled is, indeed, valueCount - 1. So, I think this is a good fix.

I would feel more confident if we had a unit test. That this bug exists suggests that, despite the rather excessive number of tests I wrote way back when, I somehow omitted this case. Perhaps a reason is that the bit vector is special: it uses code different than the other vectors, and so needs its own tests. Tests for the fixed-width vectors are in TestFixedWidthWriter and TestFillEmpties. Perhaps you can whip up a TestBitVector that does a few tests, including for the special case you uncovered. In particular, it would be good to make sure that the following cases work:

This gives us 9 cases, some that will force a resize, some that should just fit in the initial allocation.

Then, read the bits out of the vector, verifying that the "empty" bits are the defaultValue (the "default default" is 0).

Why write tests? It is less embarrassing for a test to point out a problem than to find it in a production system. I apologize that I somehow missed doing the tests in the first place.

The second concern is whether this same error was made for other writers. The answer seems to be "no": the other writers use a different structure for setValueCount(). As the comment in BitColumnWriter says, it is the only vector that defers to the mechanism in the vector's own mutator. The result is that the bit vector code differs from the other vectors which handle the entire write process themselves.

rymarm commented 1 month ago

@paul-rogers thank you so much for the review! I found and fixed one mistake which the failed tests highlighted already and will write additional tests for the cases you described.

I agree that test-driven development is much safer and better than a simple fix without test coverage of the affected place. It is just hard to me to write tests for Drill, they looks complicated and not isolated to me. But I'll write them.

paul-rogers commented 1 month ago

@rymarm, thanks for looking at the tests. While writing tests for Drill can be hard, we developed a bunch of tools for particular cases. In the case of the vector writers, those tests I referenced above should show you what to do.

What I did, back when I wrote this stuff, is to run the tests in my IDE. I've not had much luck in recent years running the whole unit test suite locally. But, I can still generally run individual unit tests. A handy way to proceed is to first create a new file that has the test frameworks included. Then, write one simple test that just writes to the vector and reads back the data. At that point, it should be clear how to create other cases.

It would be great to temporarily revert your change and create a case that fails, then restore the change and ensure things work.

Please let me know if creating the tests takes more than a small amount of time. I can perhaps help out by creating a basic bit vector use case (if I can remember how all that code works!)

cgivre commented 1 month ago

Just an FYI, but the Github actions are not working well at the moment. The two that use Hadoop 2 seem to have some connectivity issue with Hive. @jnturton is working on it.