Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Allow add_metadata methods for unigrams #68

Open bmschmidt opened 9 years ago

bmschmidt commented 9 years ago

What we can already do

Currently, it's extremely easy to add a new field pegged to an existing metadata variable: you just create a tsv file like the following (pretend the multiple spaces are tabs) and save it in continentLookup.tsv

country      continent
USA          North America
UK           Europe
Japan        Asia
...

And then run `python OneClick.py supplementMetadataFromTSV continentLookup.tsv

After that runs, you'll have a new queriable field called "continent" that can be used for anything any other metadata field can do, assigned for every document that already had an entry for country in the table you created.

What this looks like for words

For words, the analogue is pretty straightforward: you should be able to upload a list whose first column is either word or lowercase, and attaches a new score to each word.

So you could upload something like, for instance,

word         token_sentiment
USA          North America
UK           Europe
Japan        Asia
...

And then run a search for "continent_token" that would aggregate the word counts for every token matching the whatever continent you specified.

This is particularly powerful in a case like simpleminded sentiment analysis:

word         continent_token
yes          positive
no           negative
ugly        negative
...

or the creation of constrained vocabularies from something like wordnet. (Some of the PLOSone research by physicists on Ngrams takes this kind of approach.)

It would also let users use a single restriction with a group of "unigram" to return relatively rich vector data without risking a tsv file millions of lines long. You could simply request everything that a master list judges to be a "color," for example, rather than having to use the API to cook up your own list of colors every time, if that's determined to be a useful category. For placenames mentioned in text, this could be quite useful because, eg, it would allow you to return a map not just of metadata but of placenames mentioned.

One possible extension

The topic modeling algorithm currently does something similar to this, but also adds yet another layer--it actually precalculates the totals for each topic, which makes aggregate analysis reasonable in user time. The SQL API then needs to know which table to be looking stuff up in. That's another feature upgrade for it.

For the sentiment analysis use case in particular, this would be quite useful.