Datenschule / jedeschule-scraper

MIT License
22 stars 15 forks source link

update Sachsen crawler #52

Closed MartinGer closed 3 years ago

MartinGer commented 4 years ago

updated the crawler to work with the updatet website

cyroxx commented 4 years ago

Started to review and merge this PR.

At the moment, I have identified two issues:

This one:

Traceback (most recent call last):
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/media/Daten_Linux/Entwicklung/jedeschule-scraper/jedeschule/pipelines/school_pipeline.py", line 6, in process_item
    school = spider.normalize(item)
  File "/media/Daten_Linux/Entwicklung/jedeschule-scraper/jedeschule/spiders/sachsen.py", line 233, in normalize
    phone=list(item.get('phone_numbers').values())[0],
IndexError: list index out of range

And this one:

Traceback (most recent call last):
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/http/request/form.py", line 114, in _get_form
    form = forms[formnumber]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/media/Daten_Linux/Entwicklung/jedeschule-scraper/jedeschule/spiders/sachsen.py", line 29, in parse_schoolist
    callback=self.parse_school)
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/http/request/form.py", line 49, in from_response
    form = _get_form(response, formname, formid, formnumber, formxpath)
  File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/http/request/form.py", line 117, in _get_form
    (formnumber, response))
IndexError: Form number 2611 not found in <200 https://schuldatenbank.sachsen.de/index.php?id=25&id=25&feld1=01&begriff1=&bedingung=and&feld2=02&begriff2=>
cyroxx commented 3 years ago

Closing this PR, work will be continued in #71.