Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
40 stars 16 forks source link

Issue with the way stringprocessor creates words #67

Closed keien closed 10 years ago

keien commented 10 years ago

On my current version of handling-duplicates (604c4fef5c18723aecd24a3a953be37949a83cf9), there's a problem with the way stringprocessor.py creates sentences and words, and the way readerwriter handles duplicates.

In the original stringprocessor, tokenize returned a Sentence which already had a list of words associated with itself, but this was not ideal because the word_in_sentence association objects didn't have its extra fields (position, tag, etc.), and updating them on the fly was a hassle. The other option would be to remove the original words from the sentence and create new words, but that doesn't make much sense.

I tried changing the end of tokenize_from_raw to use sentence.add_word, but this meant that I would have to move the duplication handling into stringprocessor, which I don't want to do. So now I just have it pass in the list of words through sentence.tagged_words, which is not ideal, but not terrible either. However, I now get this error, which I have no idea what to do about:

Traceback (most recent call last):
  File "run_pipeline.py", line 18, in <module>
    collection_processor.process(collection_dir, structure_file, extension, False)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/collectionprocessor.py", line 61, in process
    self.parse_documents()
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/collectionprocessor.py", line 179, in parse_documents
    document_parser.parse_document(doc)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/parser/documentparser.py", line 51, in parse_document
    parse_products = self.parser.parse(sentence.text)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/stringprocessor.py", line 81, in parse
    dependencies, tokenize_from_raw(parsed, sent)[0].tagged)
  File "<string>", line 4, in __init__
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/state.py", line 196, in _initialize_instance
    return manager.original_init(*mixed[1:], **kwargs)
  File "/home/keien/dev/wordseer_flask/app/models/parseproducts.py", line 33, in __init__
    self.words = pos_tags
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 220, in __set__
    instance_dict(instance), value, None)
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 975, in set
    lambda adapter, i: adapter.adapt_like_to_iterable(i))
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 991, in _set_iterable
    new_values = list(adapter(new_collection, iterable))
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/sqlalchemy/ext/associationproxy.py", line 589, in __iter__
    for member in self.col:
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/sqlalchemy/ext/associationproxy.py", line 497, in <lambda>
    col = property(lambda self: self.lazy_collection())
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/sqlalchemy/ext/associationproxy.py", line 453, in __call__
    "stale association proxy, parent object has gone out of "
sqlalchemy.exc.InvalidRequestError: stale association proxy, parent object has gone out of scope

I'm think it'd be easier if we change the tokenize_from_raw implementation to just return a dictionary, like the way we handled dependencies. Do you think there'd be issues with that?

abendebury commented 10 years ago

Hmm, no, I don't expect issues with that. We might as well do it that way since we're handling dependencies like that in the same file; it's less of a headache.

abendebury commented 10 years ago

Is this still an issue?

keien commented 10 years ago

Fixed in 1478aff214c83287b74925dc1424cf2400fe2d29