collective / collective.solr

Solr search engine integration for Plone
https://pypi.org/project/collective.solr/
21 stars 46 forks source link

[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Parsing error xmlSAX2Characters: huge text node #239

Open NicolasGoeddel opened 4 years ago

NicolasGoeddel commented 4 years ago

There is problem with parsing huge XML outputs of the extraction handler of Solr. When I want to index a PDF file with nearly 3000 pages of text, Solr extracts that text and returns with a XML response that is handled by collective.solr.indexer.BinaryAdder. The problem here is etree.parse(response) which does not work with big text nodes. It needs to be changed to etree.iterparse() I guess. But that is a bigger change.

It would be nicer if collective.solr would extract and indexing a binary object in one single step. I don't know if this is possible with Solr's API. At the moment collective.solr extracts all the text of a binary blob using Solr, saves that text into a Dictionary and sends it back to Solr to index it. That looks not very efficient in my opinion. Maybe you know of a simple change to do both things together without that step in between.

For your information this is the whole warning:

2019-09-17 17:04:24,067 WARNING [collective.solr.indexer:178][waitress] Parsing error xmlSAX2Characters: huge text node, line 160970, column 47 (<string>, line 160970) @ /bfd-db/content/mypdf.pdf.
NicolasGoeddel commented 4 years ago

I was able to solve that problem using the etree.iterparse() method. Therefore I modified collective.solr.indexer.BinaryAdder.__call__() within the try-block directyl after the call to conn.doPost like so:

        try:
            response = conn.doPost(
                url, encodedPost.to_string(), headers
            )

            context = etree.iterparse(response, huge_tree = True)

            data["SearchableText"] = u""
            for event, elem in context :
                if elem.getparent() is not None and elem.getparent().tag == 'response' :
                    if elem.text is not None :
                        data["SearchableText"] += elem.text.strip()

        except SolrConnectionException as e:
        ....
tisto commented 4 years ago

@NicolasGoeddel thanks for reporing this and providing a fix. This is highly appreciated. I'd be more than happy to review and merge a PR if you would care to open one. :)

NicolasGoeddel commented 4 years ago

I will take a look into how PRs work. I never did one. Seems like I have to Fork first, make a branch and such things.

tisto commented 4 years ago

@NicolasGoeddel awesome! Yes, you can fork the repo and then do a pull request or checkout the repository from the collective. For the latter option, I would have to add you to the Plone collective. I'd be more than happy to do so if you are ok with it.