Closed keien closed 10 years ago
Hard to say without the dataset. Where can I get it?
Just pull from sentence_error_handling
Looks like a structure file issue, not sure exactly what is causing it though.
Perhaps there is some difference between the top level elements in the tweets
structure file and the shakespeare
structure file?
these files are both from hassan's tagger
Found the error: Hassan's tagger generates xpaths that end with \
, which lxml doesn't like.
@jannah is it possible to fix this?
also, the top-level tweets
from the twitter dataset json doesn't have a structureName
, which also errors out. I don't know if this is because I didn't name it or if it's a bug, but is ti possible to set a default value or something? Or should we do a check to see if it's there?
I will look at those tomorrow
Regards, Hassan Jannah On Jul 31, 2014 12:20 AM, "Keien Ohta" notifications@github.com wrote:
also, the top-level tweets from the twitter dataset json doesn't have a structureName, which also errors out. I don't know if this is because I didn't name it or if it's a bug, but is ti possible to set a default value or something? Or should we do a check to see if it's there?
— Reply to this email directly or view it on GitHub https://github.com/Wordseer/wordseer_flask/issues/117#issuecomment-50723808 .
I will keep dumping bugs here as I find them for now.
The metadata in the structure file has an attribute called "attr"
which is an empty string, which lxml is also not happy about when it tries to read it. Is it possible to not have it there at all (I notice the shakespeare structure file does not have an "attr"
in the metadata) if it does not receive a value?
Also apparently I have no idea how to use the tagger because I managed to get it to run by manually fixing the structure file, but I ended up with a bunch of gibberish that all got stored as properties of the top-most document. Can you maybe create a mapping for the twitter document that should work?
I created a file for you... (_1)
Regards, Hassan M. Jannah
Mobile: 1-510-990-1418
On Thu, Jul 31, 2014 at 1:14 AM, Keien Ohta notifications@github.com wrote:
Also apparently I have no idea how to use the tagger because I managed to get it to run by manually fixing the structure file, but I ended up with a bunch of gibberish that all got stored as properties of the top-most document. Can you maybe create a mapping for the twitter document that should work?
— Reply to this email directly or view it on GitHub https://github.com/Wordseer/wordseer_flask/issues/117#issuecomment-50729864 .
Thanks.
Another bug: because
{
"name":"sentence",
"dataType":"string",
"structureName":"text",
"tag":"text",
"combine":false,
"units":[
],
"xpaths":[
"/tweets/tweet/text/text()"
],
"type":"subunit",
"id":"tagger-id-tweetstweettext",
"metadata":[
]
}
has a units
declaration, our pipeline assumes it has children and does not check for sentences. This seems like a bug, as even units with children could contain sentences. Either we need to not have a units
declaration at the sentence level, or we need to change how we process this.
After manually fixing the above, I managed to run it and have the document structure with all the sentences and everything come out properly. However, the properties are messed up; for each tweet, all of its properties have the same value, which seems to be this:
value = unicode(etree.tostring(node.getparent(), encoding="utf-8", method="text")).strip()
from structureextractor.py line 280. I'm not exactly sure what's causing this.
Everything is fixed except the last comment which I am not sure about. I will open a separate issue for it and close this one.
Any ideas?