Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Use random instead of sequential bookids. #103

Open bmschmidt opened 8 years ago

bmschmidt commented 8 years ago

A niche case for some extensions is to use a bookworm word count index as an input to a batched machine learning process. I sometimes want to train a neural network on word counts, for example.

A SELECT * FROM master_bookcounts query will provide this data, sorted by bookid. Most large text corpora have some pre-specified order. For example, Chronicling America is arrange by newspaper, and then by year within each newspaper. This will make learning outputs less effective, since each batch should be as representative of the whole as possible. A gradient descent algorithm might lurch quite far off in the wrong direction, or even (gasp) find and settle into a local maximum.

If bookids are randomly assigned, rather than assigned sequentially, there's a workaround for this: the primary table keys will allow SELECT * FROM master_bookcounts ORDER BY bookid to be delivered nearly as fast as a table scan, but it will now have a random order.

Since bookids are arbitrary and internal, I don't see a downside here. It's possible, though, that index construction might take longer if the tables aren't sorted to begin with. So that's something to watch out for.

This might create problems for additions of bookids down the road. Therefore, the random ids should be drawn from the full range 1,16777215, which is the maximum bookid supported under the standard scheme, not just the range 1,catalog_length.

organisciak commented 8 years ago

Seems unnecessarily complex for a very specific use. I've been training based on seeded hashes, can something like that be implemented for random but predictable ordering? With your approach, would you check for id collisions, or just have the id sufficiently large that it can be reasonable expected not to happen?

On Fri, Jun 24, 2016, 9:20 AM Benjamin Schmidt notifications@github.com wrote:

A niche case for some extensions is to use a bookworm word count index as an input to a batched machine learning process. I sometimes want to train a neural network on word counts, for example.

A SELECT * FROM master_bookcounts query will provide this data, sorted by bookid. Most large text corpora have some pre-specified order. For example, Chronicling America is arrange by newspaper, and then by year within each newspaper. This will make learning outputs less effective, since each batch should be as representative of the whole as possible. A gradient descent algorithm might lurch quite far off in the wrong direction, or even (gasp) find and settle into a local maximum.

If bookids are randomly assigned, rather than assigned sequentially, there's a workaround for this: the primary table keys will allow SELECT * FROM master_bookcounts ORDER BY bookid to be delivered nearly as fast as a table scan, but it will now have a random order.

Since bookids are arbitrary and internal, I don't see a downside here. It's possible, though, that index construction might take longer if the tables aren't sorted to begin with. So that's something to watch out for.

This might create problems for additions of bookids down the road. Therefore, the random ids should be drawn from the full range 1,16777215, which is the maximum bookid supported under the standard scheme, not just the range 1,catalog_length.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/103, or mute the thread https://github.com/notifications/unsubscribe/AADj721eVnVYAPR-lEzLUk7xjVS7VXoFks5qO_WzgaJpZM4I93Ym .

bmschmidt commented 8 years ago

The numbers here are small enough that there doesn't need to be hashing for assignment.

Essentially, right now, the books are assigned an id from popping off the list [1,2,3,...,16777215]. I was thinking of just materializing that full list in memory and doing a Fisher-Yates shuffle or something on that list before running the id assignment. That would guarantee no collisions.

It should be possible to run that shuffle with a seed. But I'm not sure what the benefits would be of that.

If using a four-byte integer (as they do in Chicago, I've heard), the assignment would have to check for collisions against the already-assigned documents. Which is OK, because ID-assignment has never been a thread-safe (or thread-demanding) process.