Petitions: Handle Exceptions for Petitions Scraper properly

lyrixderaven commented 8 years ago

Currently, the petitions scraper still throws one or the other exception, for instance:

ERROR:scrapy.core.scraper:Spider error processing <GET     http://www.parlament.gv.at/PAKT/VHG/XXV/BI/BI_00058/index.shtml> (referer: None)
Traceback (most recent call last):
 File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
   current.result = callback(current.result, *args, **kw)
 File "/vagrant/offenesparlament/op_scraper/scraper/parlament/spiders/petitions.py", line 174, in parse
   petition_creators = self.parse_creators(response)
 File "/vagrant/offenesparlament/op_scraper/scraper/parlament/spiders/petitions.py", line 442, in parse_creators
   creators = PETITION.CREATORS.xt(response)
 File "/vagrant/offenesparlament/op_scraper/scraper/parlament/resources/extractors/petition.py", line 54, in xt
   parl_id = creator_sel.xpath("//a/@href").extract()[0].split("/")[2]
IndexError: list index out of range

While it's ok that some things don't work out when scraping, we need to catch all exceptions, or otherwise the Django Reversion stop the database commits, and nothing that was scraped ends up saved.

lyrixderaven commented 8 years ago

Still getting quite a few exceptions:

NFO:celery.redirected:2015-12-06 16:05:59 [scrapy] ERROR: Spider error processing <GET http://www.parlament.gv.at/PAKT/VHG/XXV/BI/BI_00003/index.shtml> (referer: None)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/vagrant/offenesparlament/op_scraper/scraper/parlament/spiders/petitions.py", line 127, in parse
    reference = self.parse_reference(response)
  File "/vagrant/offenesparlament/op_scraper/scraper/parlament/spiders/petitions.py", line 462, in parse_reference
    law__legislative_period=llp, law__parl_id=reference[1])
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 679, in filter
    return self._filter_or_exclude(False, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 697, in _filter_or_exclude
    clone.query.add_q(Q(*args, **kwargs))
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/query.py", line 1301, in add_q
    clause, require_inner = self._add_q(where_part, self.used_aliases)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/query.py", line 1328, in _add_q
    current_negated=current_negated, connector=connector, allow_joins=allow_joins)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/query.py", line 1144, in build_filter
    lookups, parts, reffed_aggregate = self.solve_lookup_type(arg)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/query.py", line 1030, in solve_lookup_type
    _, field, _, lookup_parts = self.names_to_path(lookup_splitted, self.get_meta())
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/query.py", line 1386, in names_to_path
    "Choices are: %s" % (name, ", ".join(available)))
FieldError: Cannot resolve keyword 'law' into field. Choices are: _slug, category, category_id, creators, description, documents, id, keywords, law_ptr, law_ptr_id, laws, legislative_period, legislative_period_id, opinions, parl_id, petition_signatures, press_releases, redistribution, reference, reference_id, references, references_id, signable, signature_count, signing_url, source_link, status, steps, title

lyrixderaven commented 8 years ago

Found another one when scraping through the admin-scraper:

ERROR:celery.worker.job:Task op_scraper.tasks.scrape[2bf9be25-7142-4ddd-88a7-675dce4c370c] raised unexpected: TypeError('coercing to Unicode: need string or buffer, dict found',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/vagrant/offenesparlament/op_scraper/tasks.py", line 24, in scrape
    process.start()
  File "/usr/local/lib/python2.7/dist-packages/reversion/revisions.py", line 290, in __exit__
    self._context_manager.end()
  File "/usr/local/lib/python2.7/dist-packages/reversion/revisions.py", line 176, in end
    in manager_context.items()
  File "/usr/local/lib/python2.7/dist-packages/reversion/revisions.py", line 175, in <genexpr>
    for obj, data
  File "/usr/local/lib/python2.7/dist-packages/reversion/revisions.py", line 616, in <lambda>
    version_data = lambda: adapter.get_version_data(instance, self._revision_context_manager._db)
  File "/usr/local/lib/python2.7/dist-packages/reversion/revisions.py", line 109, in get_version_data
    "object_repr": force_text(obj),
  File "/usr/local/lib/python2.7/dist-packages/django/utils/encoding.py", line 92, in force_text
    s = six.text_type(s)
TypeError: coercing to Unicode: need string or buffer, dict found

Seems to me that one of the object's representations (unicode or repr) returns a dictionary instaed of a string. Must be one of 'your' objects though, since this does not occur with other scrapers.

Please try and run your scraper from the admin-interface against an empty/pristine database. If everything worked, the petitions should have been saved. If you run your scraper that way and there are no petitions in the DB afterwards, check the logs (ignore/var/log/celery_worker.*) for stacktraces that inhibit django reversions or the scraper itself to properly save the petitions.

Horrendus commented 8 years ago

just now found that. my pull request fixes the FieldError, looking into the second error

Horrendus commented 8 years ago

Are you sure the second error is only related to Petitions? I just scraped laws_initatives (or pre_laws) and had the same error at the end. Also it seems to be related to the kwargs of the scraper and not the individual scraped objects,

Forum-Informationsfreiheit / OffenesParlament

Petitions: Handle Exceptions for Petitions Scraper properly #21