Revise parser to make html cleanup optional

b-cube / semantics-preprocessing

initial text preprocessors for the triplestore and feature classification

Other

2 stars 3 forks source link

Revise parser to make html cleanup optional #16

Open roomthily opened 9 years ago

roomthily commented 9 years ago

Related: #3 encoding problems.

So there's a parsing pathway for the NLP pipeline (clean everything) and a pipeline to the triplestore (text from the node, untouched).

Tasks:

[x] unicode escape cruft removal
[ ] add those as options to the xml parser - possible that we don't want to strip out the html tags for the triplestore

roomthily commented 9 years ago

See the rawresponse class - from solr to xml as string parsable by etree. Note that the html tag removal can't be here - it's running against the xml text blocks instead. Likely also of any encoding issues related to the unicode escape.

So basic text cleanup just to parse and then the two other cleanup tasks against the xml.

roomthily commented 9 years ago

Note: the CDATA wrapper for raw_content is not part of the newer nutch plugin/extension/etc. So the removal is there but likely unnecessary.

roomthily commented 9 years ago

We are only stripping out the unicode escape cruft if it precedes the initial XML tag - we just want a etree-parsable string.