get_nodes_from_xpath only works when passed on a node from etree.fromstring, not etree.parse

Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.

http://wordseer.berkeley.edu/

42 stars 16 forks source link

get_nodes_from_xpath only works when passed on a node from etree.fromstring, not etree.parse #130

Closed keien closed 10 years ago

keien commented 10 years ago

For the personals set, only the documents who went to the except block of

        try:
            doc = etree.parse(infile)
        except(etree.XMLSyntaxError):
            file_string = None
            with open(infile) as file:
                file_string = "".join([line for line in file])

            file_string = file_string.replace("&", "&amp;")

            # If parsing still doesn't work, skip it
            try:
                doc = etree.fromstring(file_string)
            except(etree.XMLSyntaxError) as e:
                print("XML Error: " + str(e))
                return documents

had self.extract_unit_information(self.document_structure, doc) return a non-empty list, because get_nodes_from_xpath(xpath, parent_node) returns an empty list when passed in the result of etree.parse as parent_node.

abendebury commented 10 years ago

Which except block do you mean? Also, I am not sure it's a good idea to load the entire file into memory.

keien commented 10 years ago

The first one. And I agree; like I said on 8671f0dd3fc8610b794d512db124cdf7ba44a8ac, we shouldn't have to deal with invalid XML files. But we do need to figure out why etree.parse doesn't work.

The personals dataset is in the repository so you should be able to replicate the behavior.

abendebury commented 10 years ago

Looks like run_pipeline.py is broken on removing-readerwriter?

keien commented 10 years ago

Yeah; either you can revert the changes to the two files I pointed out on https://github.com/Wordseer/wordseer_flask/commit/fd583f1d0c51da2fb55794355e10fddbbd7a7401, or I could push out my current version which removes all readerwriter dependencies but probably breaks more unit tests

keien commented 10 years ago

I would recommend you just change the two lines of code in https://github.com/Wordseer/wordseer_flask/commit/c5408a4ae32f38b75e492d6756b337f9c3e44d54 (collectionprocessor line 21 and run_pipeline line 14)

abendebury commented 10 years ago

Could be another xpath issue. Looking at the code, it's possible that you'd get an xpath like case, which doesn't return anything.

https://github.com/Wordseer/wordseer_flask/blob/removing-readerwriter/tests/data/personals/structure.json#L3

Perhaps changing it to /case would fix our issues.

abendebury commented 10 years ago

Looking at other files, that's exactly how their root xpath is.

keien commented 10 years ago

This is relevant to #123; we need to decide on what the xpath is going to look like and how lxml will handle it

keien commented 10 years ago

Otherwise though, you're right; changing the outer xpath to /case fixes this.

abendebury commented 10 years ago

The shakespeare xpaths seem to work okay, I guess that's what they should look like.

keien commented 10 years ago

but it uses relative xpath, which Hassan doesn't want

keien commented 10 years ago

I'm going to close this since we solved the issue and we're discussing the xpath issue elsewhere