BookwormDB is the main code repository for the Bookworm project. Given simply formatted files and metadata, it creates an efficient and easily queryable MySQL database that can make full use of all the metadata and lexical data in the original source. It also includes a powerful API for asking a variety of unigrammatic queries about that data.
A quick walkthrough is included below: other documentation is at [bookworm.culturomics.org]() and in a Bookworm Manual on this repository (editable at the repo here).
Installation is tested on Ubuntu and OS X. It may work on other Unixes, but will probably not work on Windows.
pip install .
.bookworm --help
to confirm the executable has worked. If this doesn't work, file
a bug report.bookworm config mysql
for some interactive prompts to allow Bookworm to edit MySQL databases on your server. (Note that this makes some other changes to your mysql configuration files; you may want to copy them first if you're using it for other things.)The master
branch is regularly tested on Travis; you are generally best off installing the latest version.
This builds a database and implements the Bookworm API on particular set of texts.
Some basic, widely appealing visualizations of the data are possible with the Bookworm web app, which runs on top of the API.
A more wide-ranging set of visualizations is available built on top of D3 in the Bookworm D3 package. If you're looking to develop on top of Bookworm, that presents a much more flexible set of tools.
Here are a couple of Bookworms built using BookwormDB:
We're working on docker containerization. Help appreciated. Contact bs 145 at nyu dot edu
,
no spaces involved.
You must have a MySQL database set up that you can log into with admin access,
probably with a my.cnf
file at ~/.my.cnf. Depending on your platform, this
can be a little tricky to set up.
Bookworm will automatically create a select-only user that handles web queries, preventing any malicious actions through the API.
There is a command bookworm config mysql
that will interactively update
certain files in your global my.cnf. It may need to be run with admin privileges.
Bookworm by default tries to log on with admin privileges with the following preferences:
[client]
host = 127.0.0.1
user = root
password = ''
But it also looks in several locations--~/etc/my.cnf
, ~/etc/.my.cnf
, and /etc/bookworm/admin.cnf
--for other passwords.
(I don't have an empty root password on my local MySQL server!).
It updates the host, user, and password with values from each of those files
if they exist in that order.
The command bookworm config mysql-info
shows you what password and host it's
trying to use.
In addition to the username and password, the host matters as well. Depending on setup, 'localhost' and '127.0.0.1' mean different things to mysql (the former is a socket, the latter a port). Depending on exactly how you're invoking mysql, you may need to use one or the other to communicate. For instance, your root account might not have login privileges through 127.0.0.1, just at localhost--depends exactly how the server is invoked.
To debug mysql permissions issues type mysql -u $USER -h 127.0.0.1 -p
at the prompt,
use your password. Once you have confirmed that brings up a mysql prompt that
can grant privileges, copy those files into something at ~/.my.cnf
(or if
you're able, /etc/bookworm/admin.cnf
)
in the format given by bookworm config mysql-info
(or the above block.)
This distribution also includes two files, general_api.py and SQLapi.py, which together constitute an implementation of the API for Bookworm, written in Python. It primarily implements the API on a MySQL database now, but includes classes for more easily implementing it on top of other platforms (such as Solr).
It is used with the Bookworm GUI
and can also be used as a standalone tool to query data from your database.
To run the API in its most basic form, type bookworm query $string
,
where $string is a json-formatted query. In general, query performance will be
faster over bookworm's API process, which you can start by typing bookworm serve
and querying over port 10012.
While the point of the command-line tool bookworm
is generally to create a Bookworm, the API is to retrieves results from it.
For a more interactive explanation of how the GUI works, see the [Vega-Bookworm project sandbox].
These are some instructions on how to build a bookworm.
We'll use a collection of 450 novels in 3 languages:
Piper, Andrew (2016): txtlab Multilingual Novels. figshare.
wget https://ndownloader.figshare.com/files/3686805
wget https://ndownloader.figshare.com/files/3686778
unzip 3686778
For this set, a simple python script suffices to build the two needed files, using the textlab's files. Paste this into parse.py.
import pandas as pd
import json
output = open("input.txt", "w")
catalog = open("jsoncatalog.txt", "w")
for book in pd.read_csv("3686805").to_dict(orient="records"):
try:
fulltext_lines = open(f"2_txtalb_Novel450/{book['filename']}").readlines()
# Bookworm reserver newline and tab characters, so they are stripped before
fulltext = "\f".join(fulltext_lines)
fulltext = fulltext.replace("\r", " ").replace("\n", " ").replace("\t", " ")
book['filename'] = str(book['id'])
output.write(f"{book['filename']}\t{fulltext}\n")
book['searchstring'] = book['title'] + ' ' + book['author']
catalog.write(json.dumps(book) + "\n")
except FileNotFoundError:
# This dataset has errors!
continue
python parse.py
Create a bookworm.cnf file in the file. (This isn't always necessary; usually it can just infer the database name from your current directory.)
echo "[client]\ndatabase=txtlab450" > bookworm.cnf
bookworm init
bookworm build all
input.txt
In this format, each line consists of the file's unique identifier, followed by a tab, followed by the full text of that file. Note that you'll have to strip out all newlines and returns from original documents. In the event that an identifier is used twice, behavior is undefined.
By changing the makefile, you can also do some more complex substitutions. (See the metadata parsers for an example of a Bookworm that directly reads hierarchical, bzipped directories without decompressing first).
jsoncatalog.txt
with one JSON object per line. ("newline-delimited json" format.)
The keys represent shared metadata for each file: the values represent the entry for that particular document. There should be no new line or tab characters in this file.In addition to the metadata you choose, two fields are required:
A searchstring
field that contains valid HTML which will be served to the user to identify the text.
A filename
field that includes a unique identifier for the document (linked to the filename or the identifier, depending on your input format).
Note that the python script above does both of these at once.
Now create a file in the field_descriptions.json
which is used to define the type of variable for each variable in jsoncatalog.txt
.
Currently, you do have to include a searchstring
definition in this, but should not include a filename definition.
For a first run, you just want to use bookworm init
to create the entire database (if you want to rebuild parts of a large bookworm--the metadata, for example--that is also possible.)
bookworm init
This will walk you through the process of choosing a name for your database.
Then to build the bookworm, type
bookworm build all
Depending on the total number and average size of your texts, this could take a while. Sit back and relax.
Finally, you want to implement the API and see some results.
Type
bookworm serve
To start a process on port 10012 that responds to queries. This daemon must run continuously.
Then you can access query results over http. Try visiting this page in a web browser.
http://localhost:10012/?q={%22database%22:%22txtlab450%22,%22method%22:%22data%22,%22format%22:%22csv%22,%22groups%22:[%22date%22,%20%22language%22],%22counttype%22:[%22TextCount%22,%22WordCount%22]}
Once this works, you can use various libraries to query the endpoint, or create an HTML page that builds off the endpoint. See the (currently underdeveloped) Bookworm-Vega repository for some examples.
Serving from localhost:10012 won't work especially well in production contexts. Heavy-duty web servers do rate limiting and other things that the gunicorn process bookworm uses don't handle.
One strategy is to serve the web site (using bookworm-vega or something else) over port 80, while passing all cgi-requests through to port 10012 where the bookworm server handles them. (Note that this may disable other cgi services on that particular server.)
This means it's possible to run the bookworm server anywhere, and then just forward the connection to your server using ssh tunnels. (Note that doing so may be inefficient, because it adds an extra layer of packet encoding. I'm open to better solutions here).
The steps for Apache are:
bookworm serve
).sudo a2dismod cgi
sudo a2enmod proxy proxy_ajp proxy_http rewrite deflate headers proxy_balancer proxy_connect proxy_html
<Proxy *>
Order deny,allow
Allow from all
</Proxy>
ProxyPreserveHost On
<Location "/cgi-bin">
ProxyPass "http://127.0.0.1:10012/"
ProxyPassReverse "http://127.0.0.1:10012/"
</Location>