lxml error on tweets dataset

keien commented 10 years ago

Any ideas?

(venv)keien:wordseer_flask$ python run_pipeline.py 
Traceback (most recent call last):
  File "run_pipeline.py", line 18, in <module>
    collection_processor.process(collection_dir, structure_file, extension, False)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/collectionprocessor.py", line 55, in process
    docstruc_filename, filename_extension)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/collectionprocessor.py", line 135, in extract_record_metadata
    filename))
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 39, in extract
    units = self.extract_unit_information(self.document_structure, doc)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 70, in extract_unit_information
    nodes = get_nodes_from_xpath(xpath, parent_node)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 289, in get_nodes_from_xpath
    if len(xpath.strip()) == 0 or nodes in nodes.xpath("../" + xpath):
  File "lxml.etree.pyx", line 2115, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57669)
  File "xpath.pxi", line 370, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:146579)
  File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144977)
  File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144832)
lxml.etree.XPathEvalError: Invalid expression

abendebury commented 10 years ago

Hard to say without the dataset. Where can I get it?

keien commented 10 years ago

Just pull from sentence_error_handling

abendebury commented 10 years ago

Looks like a structure file issue, not sure exactly what is causing it though.

abendebury commented 10 years ago

Perhaps there is some difference between the top level elements in the tweets structure file and the shakespeare structure file?

keien commented 10 years ago

these files are both from hassan's tagger

keien commented 10 years ago

Found the error: Hassan's tagger generates xpaths that end with \, which lxml doesn't like.

@jannah is it possible to fix this?

keien commented 10 years ago

also, the top-level tweets from the twitter dataset json doesn't have a structureName, which also errors out. I don't know if this is because I didn't name it or if it's a bug, but is ti possible to set a default value or something? Or should we do a check to see if it's there?

jannah commented 10 years ago

I will look at those tomorrow

Regards, Hassan Jannah On Jul 31, 2014 12:20 AM, "Keien Ohta" notifications@github.com wrote:

also, the top-level tweets from the twitter dataset json doesn't have a structureName, which also errors out. I don't know if this is because I didn't name it or if it's a bug, but is ti possible to set a default value or something? Or should we do a check to see if it's there?

— Reply to this email directly or view it on GitHub https://github.com/Wordseer/wordseer_flask/issues/117#issuecomment-50723808 .

keien commented 10 years ago

I will keep dumping bugs here as I find them for now.

The metadata in the structure file has an attribute called "attr" which is an empty string, which lxml is also not happy about when it tries to read it. Is it possible to not have it there at all (I notice the shakespeare structure file does not have an "attr" in the metadata) if it does not receive a value?

keien commented 10 years ago

Also apparently I have no idea how to use the tagger because I managed to get it to run by manually fixing the structure file, but I ended up with a bunch of gibberish that all got stored as properties of the top-most document. Can you maybe create a mapping for the twitter document that should work?

jannah commented 10 years ago

I created a file for you... (_1)

Regards, Hassan M. Jannah

Mobile: 1-510-990-1418

On Thu, Jul 31, 2014 at 1:14 AM, Keien Ohta notifications@github.com wrote:

Also apparently I have no idea how to use the tagger because I managed to get it to run by manually fixing the structure file, but I ended up with a bunch of gibberish that all got stored as properties of the top-most document. Can you maybe create a mapping for the twitter document that should work?

— Reply to this email directly or view it on GitHub https://github.com/Wordseer/wordseer_flask/issues/117#issuecomment-50729864 .

keien commented 10 years ago

Thanks.

Another bug: because

            {
               "name":"sentence",
               "dataType":"string",
               "structureName":"text",
               "tag":"text",
               "combine":false,
               "units":[

               ],
               "xpaths":[
                  "/tweets/tweet/text/text()"
               ],
               "type":"subunit",
               "id":"tagger-id-tweetstweettext",
               "metadata":[

               ]
            }

has a units declaration, our pipeline assumes it has children and does not check for sentences. This seems like a bug, as even units with children could contain sentences. Either we need to not have a units declaration at the sentence level, or we need to change how we process this.

keien commented 10 years ago

After manually fixing the above, I managed to run it and have the document structure with all the sentences and everything come out properly. However, the properties are messed up; for each tweet, all of its properties have the same value, which seems to be this:

value = unicode(etree.tostring(node.getparent(), encoding="utf-8", method="text")).strip()

from structureextractor.py line 280. I'm not exactly sure what's causing this.

jannah commented 10 years ago

Everything is fixed except the last comment which I am not sure about. I will open a separate issue for it and close this one.

Wordseer / wordseer

lxml error on tweets dataset #117