Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Cleaning database doesn't strip out fixed lengths for categorical variables. #37

Closed bmschmidt closed 10 years ago

bmschmidt commented 10 years ago

Say you test-build your bookworm out with 100 files, and there are 50 library of congress subjects in those hundred files.

CreateDatabase.py will assign a TINYINT value to LCSH__id when it builds a fast lookup table.

But then you build it with the whole thing: 1000 files, say. And now there are 300 library of congress subjects.

The CreateDatabase.py doesn't drop all the original tables when it loads in your new data. But now you need 300 LCSH__id identifiers; but you've been locked into a format that only allocates 128 spots for them.

Solution: Drop the tables altogether on every build? Dynamically rebuild them based on the new information? It's a tricky call, I think.

bmschmidt commented 10 years ago

I've just added a new command to the Makefile; in addition to make clean there's also make pristine. Not a complete fix, but good enough for now.