Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

bookworm build all #91

Closed tpmccallum closed 8 years ago

tpmccallum commented 8 years ago

The following is a trail of my latest build, most issues were resolved and this is just a trail of events which may help others.

I do have a specific issue (which is unresolved) which you can see if you scroll down to the very end of this page.

I get the following error when running the bookworm build all

make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/metadata/jsoncatalog_derived.txt
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
cat .bookworm/metadata/jsoncatalog.txt | parallel --pipe bookworm -l WARNING -d mccallum prep catalog_metadata > .bookworm/metadata/jsoncatalog_derived.txt
cat: .bookworm/metadata/jsoncatalog.txt: No such file or directory
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/texts/textids.dbm
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
cat .bookworm/metadata/jsoncatalog.txt | parallel --pipe bookworm -l WARNING -d mccallum prep catalog_metadata > .bookworm/metadata/jsoncatalog_derived.txt
cat: .bookworm/metadata/jsoncatalog.txt: No such file or directory
bookworm -l WARNING -d mccallum prep preDatabaseMetadata
Traceback (most recent call last):
  File "/usr/local/bin/bookworm", line 9, in <module>
    load_entry_point('bookwormDB==0.4.0', 'console_scripts', 'bookworm')()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 556, in run_arguments
    getattr(my_bookworm,args.action)(args)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 234, in prep
    getattr(self,args.goal)()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 267, in preDatabaseMetadata
    Bookworm = bookwormDB.CreateDatabase.BookwormSQLDatabase()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 144, in __init__
    self.setVariables(originFile=variableFile)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 161, in setVariables
    self.variableSet = variableSet(originFile=originFile, anchorField=anchorField, jsonDefinition=jsonDefinition,db=self.db)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/variableSet.py", line 476, in __init__
    self.jsonDefinition = json.loads(open(jsonDefinition,"r").read())
IOError: [Errno 2] No such file or directory: '.bookworm/metadata/field_descriptions_derived.json'
make[1]: *** [.bookworm/metadata/catalog.txt] Error 1
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make: *** [.bookworm/targets/encoded] Error 2
tpmccallum commented 8 years ago

I had the following set up as per the documentation. < https://github.com/Bookworm-project/BookwormDB/blob/master/README.md >

folder/
    | field_descriptions.json
    | jsoncatalog.txt
    | input.txt

But I was getting errors like this

IOError: [Errno 2] No such file or directory: '.bookworm/metadata/field_descriptions.json'

And it looked like it was trying to cat a file which did not yet exist.

cat .bookworm/metadata/jsoncatalog.txt

My solution was to move the files (which the code was complaining about) into the place where the build was looking for them (in this case .bookworm/metadata)

mv mccallum/field_descriptions.json .bookworm/metadata/
mv  mccallum/jsoncatalog.txt .bookworm/metadata/

In the case of the input.txt file, I still have that in the directory as per the documentation

folder/
    | input.txt

This really helped and the build ran for a while.

tpmccallum commented 8 years ago

This ran for quite some time and then appeared to fail looking for the input.txt file

Traceback (most recent call last):
  File "/usr/local/bin/bookworm", line 9, in <module>
    load_entry_point('bookwormDB==0.4.0', 'console_scripts', 'bookworm')()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 556, in run_arguments
    getattr(my_bookworm,args.action)(args)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 97, in tokenize
    raise IOError("Unable to find an input.txt or input.sh file in a default location")
IOError: Unable to find an input.txt or input.sh file in a default location
touch .bookworm/targets/encoded
bookworm -l WARNING -d mccallum prep database_wordcounts
ERROR:root:Query failed: 
DROP TABLE IF EXISTS words

Traceback (most recent call last):
  File "/usr/local/bin/bookworm", line 9, in <module>
    load_entry_point('bookwormDB==0.4.0', 'console_scripts', 'bookworm')()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 556, in run_arguments
    getattr(my_bookworm,args.action)(args)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 234, in prep
    getattr(self,args.goal)()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 377, in database_wordcounts
    Bookworm.load_word_list()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 204, in load_word_list
    db.query("""DROP TABLE IF EXISTS words""")
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 103, in query
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 75, in connect
    cursor.execute("CREATE DATABASE IF NOT EXISTS %s" % self.dbname)
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.InternalError: (13, "Can't get stat of './mccallum' (Errcode: 13)")
make: *** [.bookworm/targets/database_wordcounts] Error 1
tpmccallum commented 8 years ago
if args.process=="text_stream":
            if args.file is None:
                for file in ["input.txt",".bookworm/texts/input.txt","../input.txt",".bookworm/texts/raw","input.sh"]:
                    if os.path.exists(file):
                        args.file = file
                        break
                if args.file is None:
                    # One of those should have worked.
                    raise IOError("Unable to find an input.txt or input.sh file in a default location")

The manager python file < https://github.com/Bookworm-project/BookwormDB/blob/master/bookwormDB/manager.py > seemed to be looking in a few places for the input .txt file. I moved mine to .bookworm/texts/input.txt so my final folder and file arrangement was like this

BookwormDB/
    .bookworm/
        metadata/
            | jsoncatalog.txt
            | field_descriptions.json
        texts/
            | input.txt
tpmccallum commented 8 years ago

I struck an issue regarding the tmp table being full. I saw that this has been addressed here already < https://github.com/Bookworm-project/BookwormDB/issues/83 > I followed the advice (from Ben) regarding increasing values in the mysql conf and everything seemed to continue well.

tpmccallum commented 8 years ago

The build completed successfully with the following output

make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/metadata/jsoncatalog_derived.txt
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
cat .bookworm/metadata/jsoncatalog.txt | parallel --pipe bookworm -l WARNING -d mccallum prep catalog_metadata > .bookworm/metadata/jsoncatalog_derived.txt
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/texts/textids.dbm
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
bookworm -l WARNING -d mccallum prep preDatabaseMetadata
bookworm -l WARNING -d mccallum prep text_id_database
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/metadata/catalog.txt
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
make[1]: `.bookworm/metadata/catalog.txt' is up to date.
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
bookworm -l WARNING -d mccallum tokenize text_stream | parallel --block-size 100M -u --pipe bookworm -l WARNING -d mccallum tokenize encode
touch .bookworm/targets/encoded
bookworm -l WARNING -d mccallum prep database_wordcounts
touch .bookworm/targets/database_wordcounts
bookworm -l WARNING -d mccallum prep database_metadata
/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py:100: Warning: Data truncated for column 'searchstring' at row 7576939
  cursor.execute(sql)
touch .bookworm/targets/database_metadata
touch .bookworm/targets/database

The database is somewhat populated

mysql> show tables;
+---------------------+
| Tables_in_mccallum  |
+---------------------+
| API_settings        |
| catalog             |
| fastcat             |
| masterTableTable    |
| masterVariableTable |
| master_bigrams      |
| master_bookcounts   |
| nwords              |
| words               |
| wordsheap           |
+---------------------+
10 rows in set (0.00 sec)

For example catalog is full of entries, but I notice that the words table is empty any suggestions?

mysql> select * from words;
+--------+------+-------+----------+------+
| wordid | word | count | casesens | stem |
+--------+------+-------+----------+------+
|      1 |      |     0 |          | NULL |
+--------+------+-------+----------+------+
tpmccallum commented 8 years ago

Here are a few more counts from the database

mysql> select count(*) from nwords;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

mysql> select * from nwords;
+--------+--------+
| bookid | nwords |
+--------+--------+
|      1 |      6 |
|      2 |      8 |
|      4 |      5 |
|      6 |      5 |
|      7 |      1 |
|      9 |      1 |
+--------+--------+
6 rows in set (0.00 sec)

mysql> select count(*) from fastcat;
+----------+
| count(*) |
+----------+
|  8605524 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from nwords;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from catalog;
+----------+
| count(*) |
+----------+
|  8605524 |
+----------+
1 row in set (0.00 sec)

Any help would be appreciated.

tpmccallum commented 8 years ago

I re-ran the build with a smaller set of data and found that the output was the same except for the larger dataset included the line

/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py:100: Warning: Data truncated for column 'searchstring' at row 7576939
  cursor.execute(sql)

Interestingly it prints out the line

cursor.execute(sql)

which is in the CreateDatabase.py file at line 100

def query(self, sql):
        """
        Billy defined a separate query method here so that the common case of a connection being
        timed out doesn't cause the whole shebang to fall apart: instead, it just reboots
        the connection and starts up nicely again.
        """
        logging.debug(" -- Preparing to execute SQL code -- " + sql)
        try:
            cursor = self.conn.cursor()
            **cursor.execute(sql)**
tpmccallum commented 8 years ago

I think this may have been an issue with running the build over a network. I ran everything again using nohup and it worked perfectly.

bmschmidt commented 8 years ago

Sorry to not get back to you while this was going on, but glad it worked out.

It may be worth putting some of these SELECT COUNT * FROM [...] commands into the code somewhere, because they do help trace what's failing.

A disconnect in the middle of a command could definitely cause problems. It seems like there must have been traces of a partial build keeping words from getting loaded in. It's very helpful to have this documentation in there. Two notes to add to the record:

  1. As you find, using nohup, tmux or screen to run the processes on a remote server is good. The last two preserve error reporting until we add an option to write them to a file, which can be useful.
  2. Sometimes partial builds will lead to an incomplete wordcount file. Executing bookworm build pristine can be very useful in these cases; it just nukes all database and local files so you can start over.
tpmccallum commented 8 years ago

Thanks Ben, Good advice, I am also really glad this all worked. I saw the pristine function but have not used it yet, this sounds very useful. Is there some way to get in contact with you (private email) as I am working on a privately funded project and a doctoral level qualification (both using Bookworm) I have a very exciting set of data (16 million entries) to show you. Tim

tpmccallum commented 8 years ago

You can email me at contact@timothymccallum.com.au