Closed MartinGer closed 3 years ago
Started to review and merge this PR.
At the moment, I have identified two issues:
This one:
Traceback (most recent call last):
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/media/Daten_Linux/Entwicklung/jedeschule-scraper/jedeschule/pipelines/school_pipeline.py", line 6, in process_item
school = spider.normalize(item)
File "/media/Daten_Linux/Entwicklung/jedeschule-scraper/jedeschule/spiders/sachsen.py", line 233, in normalize
phone=list(item.get('phone_numbers').values())[0],
IndexError: list index out of range
And this one:
Traceback (most recent call last):
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/http/request/form.py", line 114, in _get_form
form = forms[formnumber]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/media/Daten_Linux/Entwicklung/jedeschule-scraper/jedeschule/spiders/sachsen.py", line 29, in parse_schoolist
callback=self.parse_school)
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/http/request/form.py", line 49, in from_response
form = _get_form(response, formname, formid, formnumber, formxpath)
File "/home/cyroxx/.virtualenvs/jedeschule/lib/python3.7/site-packages/scrapy/http/request/form.py", line 117, in _get_form
(formnumber, response))
IndexError: Form number 2611 not found in <200 https://schuldatenbank.sachsen.de/index.php?id=25&id=25&feld1=01&begriff1=&bedingung=and&feld2=02&begriff2=>
Closing this PR, work will be continued in #71.
updated the crawler to work with the updatet website