edisona / amcat

Automatically exported from code.google.com/p/amcat
1 stars 0 forks source link

Unable to serialize (some) HtmlElement objects #625

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Pickle turns out to be unable to send HtmlElement objects over the wire. Error 
below. Possibly because the object isn't entirely in Python. 

More info: 
http://librelist.com/browser//python.pptx/2013/2/25/serialize-and-de-serialize-p
ython-pptx-objects/

A solution I can think of: Stop yielding HtmlElement objects in _scrape_unit. 
The only reason they're passed down the scraper is to parse them with 
html2text, but why not pass a string instead. This will mean changing all the 
scrapers, though.

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/task/trace.py", line 228, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/task/trace.py", line 415, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/amcat/amcat/amcat/tasks.py", line 31, in scrape_unit_save_unit
    articles = [scraper._postprocess_article(article) for article in articles]
  File "/home/amcat/amcat/amcat/scraping/scraper.py", line 191, in _postprocess_article
    article = super(DatedScraper, self)._postprocess_article(article)
  File "/home/amcat/amcat/amcat/scraping/scraper.py", line 152, in _postprocess_article
    article = article.create_article()
  File "/home/amcat/amcat/amcat/scraping/document.py", line 109, in create_article
    value = self._convert(value)
  File "/home/amcat/amcat/amcat/scraping/document.py", line 143, in _convert
    return "\n\n".join(map(self._convert, val))
  File "/home/amcat/amcat/amcat/scraping/document.py", line 130, in _convert
    for js in val.cssselect("script"):
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1535, in tostring
    doctype=doctype)
  File "lxml.etree.pyx", line 2875, in lxml.etree.tostring (src/lxml/lxml.etree.c:55745)
  File "serializer.pxi", line 93, in lxml.etree._tostring (src/lxml/lxml.etree.c:90090)
  File "apihelpers.pxi", line 15, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:10567)
AssertionError: invalid Element proxy at 76315632

Original issue reported on code.google.com by Toon.Alfrink@gmail.com on 25 Oct 2013 at 10:35

GoogleCodeExporter commented 9 years ago
Een HTMLElement kan prima geserialiseerd worden door het weer om te zetten naar 
een string. Die string kan later weer geparsed worden. Lijkt me niet onredelijk?

Original comment by Martijn....@gmail.com on 27 Oct 2013 at 2:10

GoogleCodeExporter commented 9 years ago
Goed punt.

Original comment by Toon.Alfrink@gmail.com on 28 Oct 2013 at 10:27

GoogleCodeExporter commented 9 years ago

Original comment by Toon.Alfrink@gmail.com on 30 Oct 2013 at 12:02