Consider allowing internal "repeat<count>" kind for indexed collections

willhite commented 6 years ago

I'm building up taxi data and importing data in column major order. When lists get longer than ~33 million entries, I consistently get a panic when calling db.WriteValue on a newly concatenated list.

The stack was produced by code in this #3742 which hasn't been landed into master yet.

    Error Trace:    try.go:99c:
                try.go:58
                store.go:137
                value_store.go:243
                value_store.go:297
                value_store.go:217
                importer.go:213
                struct.go:210
                importer.go:215
                proc.go:185
                asm_amd64.s:2197
Error:          Expected false

goroutine 1 [running]: github.com/attic-labs/noms/go/d.PanicIfFalse(0xc420192500) /home/ec2-user/go/src/github.com/attic-labs/noms/go/d/try.go:58 +0xa0 github.com/attic-labs/noms/go/nbs.(NomsBlockStore).Put(0xc4201925a0, 0xacc61c6e45a54e62, 0xf8e01704be5f5980, 0x5b6b469e, 0xc5bb29a000, 0x10b07606, 0x20000000) /home/ec2-user/go/src/github.com/attic-labs/noms/go/nbs/store.go:137 +0x137 github.com/attic-labs/noms/go/types.(ValueStore).bufferChunk.func1(0xacc61c6e45a54e62, 0xf8e01704be5f5980, 0xc45b6b469e, 0xacc61c6e45a54e62, 0xf8e01704be5f5980, 0x5b6b469e, 0xc5bb29a000, 0x10b07606, 0x20000000) /home/ec2-user/go/src/github.com/attic-labs/noms/go/types/value_store.go:243 +0x69 github.com/attic-labs/noms/go/types.(ValueStore).bufferChunk(0xc449416000, 0x179c900, 0xc4d9769700, 0xacc61c6e45a54e62, 0xf8e01704be5f5980, 0x5b6b469e, 0xc5bb29a000, 0x10b07606, 0x20000000, 0x1) /home/ec2-user/go/src/github.com/attic-labs/noms/go/types/value_store.go:297 +0x443 github.com/attic-labs/noms/go/types.(ValueStore).WriteValue(0xc449416000, 0x179c900, 0xc4d9769700, 0x0, 0x0, 0x0, 0x0, 0x0) /home/ec2-user/go/src/github.com/attic-labs/noms/go/types/value_store.go:217 +0x2aa main.main.func3(0xc50c80e378, 0x8, 0x179ca80, 0xc5542ba630) /home/ec2-user/go/src/github.com/attic-labs/noms/samples/go/csv/csv-import/importer.go:213 +0x399 github.com/attic-labs/noms/go/types.Struct.IterFields(0x1793500, 0xc449416000, 0xc443d94454, 0x789, 0x789, 0xc47685bb50) /home/ec2-user/go/src/github.com/attic-labs/noms/go/types/struct.go:210 +0x117 main.main() /home/ec2-user/go/src/github.com/attic-labs/noms/samples/go/csv/csv-import/importer.go:215 +0x15fc

cmasone-attic commented 6 years ago

I won't be back to look at this until Tuesday, so hopefully @rafael-atticlabs can look before then? On Wed, Sep 27, 2017 at 1:50 AM Dan Willhite notifications@github.com wrote:

Assigned #3743 https://github.com/attic-labs/noms/issues/3743 to @cmasone-attic https://github.com/cmasone-attic.

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub https://github.com/attic-labs/noms/issues/3743#event-1267079662, or mute the thread https://github.com/notifications/unsubscribe-auth/AMnImj1Npv4YYYgkpWy8U5qh6A7RzGIDks5smeIugaJpZM4PlP-A .

ghost commented 6 years ago

I suspect that what's happening here is that you are trying to a write a single chunk which is larger than memTableSize. Can you trying putting some debug code here:

https://github.com/attic-labs/noms/blob/master/go/nbs/store.go#L153

like

 nbs.mt = newMemTable(nbs.mtSize)
retval := nbs.mt.addChunk(h, data)
if !retval {
  fmt.Println("Maybe chunk is too big", nbs.mtSize, len(data))
}

willhite commented 6 years ago

@rafael Yikes. I just ran it with the debug code. I got the same stack and this printed out. Maybe chunk is too big 268435456 280000006

ghost commented 6 years ago

Yup. Changing title and making P0.

ghost commented 6 years ago

More explanation: For columnar storage of tables, it's possible to encode a giant list (e.g. 1 billion elements) of the same value (e.g. "yellow"). The adaptive compression does too good on this. IOW, and results in a chunk that will compress down e.g. to a couple of meg, but uncompressed is beyond the size of the memTable limit in nbs.

This can only occur with Lists. The proposal here is to push "up" the compression, specifically, of this narrow case (lots of repeated equal values). This prevents the decoded size from becoming a problem.

In theory, it's possible to create a degenerate case (e.g. <1, 2, 1, 2, ... 1, 2>) which would still fail, but I think it's extremely unlikely.

ghost commented 6 years ago

@arv, can you think about what the implementation of this would look like?

ghost commented 6 years ago

FWIW, I believe that various columnar db engines do stuff like this, in order, for example, to avoid decoding a massive column like this while computing AVG, when AVG can be directly observed if you understood that it's just a million values of 3

arv commented 6 years ago

The main issue is with the offsets, or to be even more clear how to map an index to a location into the buffer.

I'll have to think about this a bit but it most likely involves some binary searching...

ghost commented 6 years ago

yeah. that was my thought as well. one approach would be to have a "bit" on the sequence which says whether the sequence has any repeats. if so, locating an item by index falls down a different code path.

ghost commented 6 years ago

Thinking about this more, there are three main places that are kind of "at risk" because of a change like this:

-Accessing sequence items by index -The sequence cursor -The sequence chunker

I think one way to think about this is to make it an internal implementation detail of indexed_sequence, that users of the API are unaware of. I think this would involve changing sequence_cursor and sequence_chunker's use of the sequence API a bit.

kalman commented 6 years ago

SGTM, but it would be really nice if normal sequences weren't affected by this change.

One way to add this "bit" would be for the count field to be negative -n, signifying that there are n items in the sequence but a "repetition index" is to follow, which itself is a list of (offset, count) tuples prefixed by the number of tuples.

For example, say we're serializing ["a", "b", "c"], which has no repeats. That would be:

[
  5 /* kind = list */,
  0 /* level */,
  3 /* count */,
  3 /* kind = string */, "a", 3, "b", 3, "c",
]

Now say that we're serializing ["a", "b", "b", "c", "c"] which has repeats. That would be:

[
  5  /* kind = list */,
  0  /* level */,
  -5 /* count, but with repeats */,
  2  /* number of repeat entries */,
  1  /* offset */, 2 /* count */, 2, 2,
  3  /* kind = string */, "a", 3, "b", 3, "c",
]

(It's unclear to me whether this logic should apply to meta sequences as well. The reason would only be for consistency, since meta sequences are highly unlikely to benefit from this. OTOH the code complexity is my worry, but I haven't looked into it deeply yet.)

Solutions to the risks that @rafael-atticlabs mentions (I agree, this is what needs to change):

Accessing by index: conceptually a linear search over the "repeat index" to figure out what the value is.
Sequence cursor: these basically all call into the access-by-index code, which is fine, though we'd want to double check this isn't super inefficient.
Sequence chunker: this is the messy one, since the current sequence cursor would iterate over "virtual values" but the chunker needs to iterate over "encoded values". Basically we need to build a 2nd type of cursor. Perhaps it's possible/worthwhile to pull out some common functionality.

ghost commented 6 years ago

I think I understand the encoding you are proposing. What's the advantage of this encoding relative to adding a Repeat<Count> kind?

kalman commented 6 years ago

We wouldn't need to reason about everywhere that we do switches etc on Kind. Changing the encoding seemed more targeted / less risky (plus it works for repeated blob values, for what it's worth). Seems more like an encoding trick than a feature of the type system.

OTOH adding a Kind is neat. I'll think for a bit...

kalman commented 6 years ago

E.g. it seems like people might be doing things like if someValue.Kind() == ListKind to detect a list, and this would break, unless I'm misunderstanding?

kalman commented 6 years ago

(had an offline convo with @rafael-atticlabs and started implementing the Repeat<Count> thing)

OK, I got enough of the way through to make a couple of informed observations, but basically, rechunking is subtle. Say that you have:

ABBBBBC

and encode that repeated BBBBB some way. Then add a B in the middle. Using our current encoding this rewinds the chunker to the start of the chunk (A), adds all those Bs, then the new B, then the rest of the Bs and the C. If the new B creates a new chunk boundary, then... etc.

However, using some kind of "repeat" encoding could affect more than just the modification position:

If we maintain an index, then the start of every chunk will change. Which will affect the rest of the chunk, which may require re-writing the index, thus changing the chunking... argh! Possible solution: don't include the index as part of the hash, but this seems weird and unprecedented.
If we don't maintain an index and do everything inline, then the encoding becomes interesting. I think you're basically forced to always encode the count of each value before each, even if it's 1. I.e. 1A5B1C in the above example. Modification is either incrementing or decrementing a number, which is a localised change and much easier to reason about.

Proposal following in next issue comment...

kalman commented 6 years ago

This is starting to feel weird (complex to reason about, complex to implement, overly specific) - so proposal:

Bail on this bug, and ask that users of the Noms API be aware that a huge number of repeated values might cause an issue, and that they need to mitigate this. E.g. they could maintain this "repeat" style encoding themselves. Or they could maintain an out-of-line index (something we can't do at all). Etc.

Is this something that we can update our CSV samples to do? It's basically only things that are generated by csv-invert, right?

ghost commented 6 years ago

After some digging, I'm pretty sure that doing this is required if we want to support very long lists of repeated values and not have updates early in the repeated sequence cause cascading chunk writes all the way to the end.

kalman commented 6 years ago

Which part?

On Oct 18, 2017 8:46 PM, "Rafael Weinstein" notifications@github.com wrote:

After some digging, I'm pretty sure that doing this is required if we want to support very long lists of repeated values and not have updates early in the repeated sequence cause cascading chunk writes all the way to the end.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/attic-labs/noms/issues/3743#issuecomment-337791648, or mute the thread https://github.com/notifications/unsubscribe-auth/AAoG7h5D7v0nWGX3InXHJz8lRsEyOw_Qks5stsYrgaJpZM4PlP-A .

attic-labs / noms

Consider allowing internal "repeat<count>" kind for indexed collections #3743