adriancast / Scrapyd-Django-Template

Basic setup to run ScrapyD + Django and save it in Django Models. You can be up and running in just a few minutes
126 stars 42 forks source link

Accessing and viewing scraped data stored in database #1

Open kryptc opened 5 years ago

kryptc commented 5 years ago

I ran your code according to the instructions in the readme. I can view the responses in the logs directory in scrapy_app but when I open up my sqlite3 prompt, there are no tables or databases that have been created in spite of there being an sqlite database configured. How do you access and manipulate the scraped data? Currently, I can't verify if data has been added to my database or not.

adriancast commented 5 years ago

Hi @kryptc , I will try to update the readme this evening.

Cheers

adriancast commented 5 years ago

I just cloned the repo again and installed from the beginning.

I followed this steps to make it work:

  1. First of all I installed the requirements for the repo. I had some problems with it, but I think is only related to my machine. At this point I have a virtualenv with Django and ScrapyD installed.

  2. One you have all installed, you will need to setup the database for Django. You can do this task executing this in the terminal: python manage.py migrate In order to access the data of the databases you will also need to create a superuser. You can do it typing this in the terminal python manage.py createsuperuser

  3. It is time to start Django and ScrapyD. To run Django: python manage.py runserver To run ScrapyD: cd scrapy_app scrapyd At this point you will have both services running.

    captura de pantalla 2018-11-07 a las 20 34 42

The terminal of the left is running django, the one of the right runs ScrapyD. By default django runs in http://127.0.0.1:8000/admin/ and Scrapy runs in http://127.0.0.1:6800/ .

At this point you will have the ScrapyD admin ready to work. screenshot 7

  1. Now you need schedule the spiders in order to crawl the data. You should schedule them doing post request to the ScrapyD service.

curl http://localhost:6800/schedule.json -d project=default -d spider=toscrape-css

Once the spiders are execute, the data will be saved in the django models. Remeber that you can see the data using http://127.0.0.1:8000/admin/ and the superuser you created before. It will look something like this: screenshot 6

At this phase you are ready to go! I really hope this helps you a little. If you hace more more questions just ask them in this ticket

Cheers

kryptc commented 5 years ago

Thanks very much! I figured out there was something wrong with my sqlite. It works now.

CanderKage commented 5 years ago

Thanks for the boilerplate. Was helpful in figuring out the two framework interactions.

hanspruim commented 4 years ago

Hi,

First of all, thanks for your share and work.

I keep on getting an error message on the last curl step. Do you have any clue what can be wrong? I followed all the steps as described, but can't start the spider. Any help is appreciated.

image

Just to be sure, both instances are running and I can access them on port 8000 and 6800.

Feedback from scrapyd image

Edit: Well, let me correct this. The spider is running and quotes are being stored in het database. It is just not possible to view the 'jobs' page in scrapyd: image

Getting following error when I run the curl command:

2020-05-17T09:59:13+0000 [_GenericHTTPChannelProtocol,11,127.0.0.1] Unhandled Error Traceback (most recent call last): File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 2284, in allContentReceived req.requestReceived(command, path, version) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 946, in requestReceived self.process() File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 235, in process self.render(resrc) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 302, in render body = resrc.render(self) --- --- File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/webservice.py", line 21, in render return JsonResource.render(self, txrequest).encode('utf-8') File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/utils.py", line 21, in render return self.render_object(r, txrequest) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/utils.py", line 29, in render_object txrequest.setHeader('Content-Length', len(r)) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 1314, in setHeader self.responseHeaders.setRawHeaders(name, [value]) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in setRawHeaders for v in self._encodeValues(values)] File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in for v in self._encodeValues(values)] File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 40, in _sanitizeLinearWhitespace return b' '.join(headerComponent.splitlines()) builtins.AttributeError: 'int' object has no attribute 'splitlines'

2020-05-17T09:59:13+0000 [twisted.web.server.Request#critical] Traceback (most recent call last): File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 1755, in dataReceived finishCallback(data[contentLength:]) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 2171, in _finishRequestBody self.allContentReceived() File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 2284, in allContentReceived req.requestReceived(command, path, version) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 946, in requestReceived self.process() --- --- File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 235, in process self.render(resrc) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 302, in render body = resrc.render(self) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/webservice.py", line 27, in render return self.render_object(r, txrequest).encode('utf-8') File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/utils.py", line 29, in render_object txrequest.setHeader('Content-Length', len(r)) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 1314, in setHeader self.responseHeaders.setRawHeaders(name, [value]) File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in setRawHeaders for v in self._encodeValues(values)] File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in for v in self._encodeValues(values)] File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 40, in _sanitizeLinearWhitespace return b' '.join(headerComponent.splitlines())

adriancast commented 4 years ago

For what I can see the error is:

File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 40, in _sanitizeLinearWhitespace return b' '.join(headerComponent.splitlines()) builtins.AttributeError: 'int' object has no attribute 'splitlines'

I have no clue of what is going on to be honest. All the dependencies are pinned so it should not be a package version problem.

This repository is based on https://medium.com/@ali_oguzhan/how-to-use-scrapy-with-django-application-c16fabd0e62e article.

I really recommend you asking there if someone else had that problem. I am sorry but I am not able to reproduce the error.