Open roomthily opened 9 years ago
See the rawresponse class - from solr to xml as string parsable by etree. Note that the html tag removal can't be here - it's running against the xml text blocks instead. Likely also of any encoding issues related to the unicode escape.
So basic text cleanup just to parse and then the two other cleanup tasks against the xml.
Note: the CDATA wrapper for raw_content is not part of the newer nutch plugin/extension/etc. So the removal is there but likely unnecessary.
We are only stripping out the unicode escape cruft if it precedes the initial XML tag - we just want a etree-parsable string.
Related: #3 encoding problems.
So there's a parsing pathway for the NLP pipeline (clean everything) and a pipeline to the triplestore (text from the node, untouched).
Tasks: