Closed keien closed 10 years ago
Which except
block do you mean? Also, I am not sure it's a good idea to load the entire file into memory.
The first one. And I agree; like I said on 8671f0dd3fc8610b794d512db124cdf7ba44a8ac, we shouldn't have to deal with invalid XML files. But we do need to figure out why etree.parse
doesn't work.
The personals dataset is in the repository so you should be able to replicate the behavior.
Looks like run_pipeline.py
is broken on removing-readerwriter
?
Yeah; either you can revert the changes to the two files I pointed out on https://github.com/Wordseer/wordseer_flask/commit/fd583f1d0c51da2fb55794355e10fddbbd7a7401, or I could push out my current version which removes all readerwriter dependencies but probably breaks more unit tests
I would recommend you just change the two lines of code in https://github.com/Wordseer/wordseer_flask/commit/c5408a4ae32f38b75e492d6756b337f9c3e44d54 (collectionprocessor line 21 and run_pipeline line 14)
Could be another xpath issue. Looking at the code, it's possible that you'd get an xpath like case
, which doesn't return anything.
Perhaps changing it to /case
would fix our issues.
Looking at other files, that's exactly how their root xpath is.
This is relevant to #123; we need to decide on what the xpath is going to look like and how lxml will handle it
Otherwise though, you're right; changing the outer xpath to /case
fixes this.
The shakespeare xpaths seem to work okay, I guess that's what they should look like.
but it uses relative xpath, which Hassan doesn't want
I'm going to close this since we solved the issue and we're discussing the xpath issue elsewhere
For the personals set, only the documents who went to the
except
block ofhad
self.extract_unit_information(self.document_structure, doc)
return a non-empty list, becauseget_nodes_from_xpath(xpath, parent_node)
returns an empty list when passed in the result ofetree.parse
asparent_node
.