baabaaox / ScrapyDouban

豆瓣电影/豆瓣读书 Scarpy 爬虫
729 stars 201 forks source link

关于运行环境 #23

Open Luobeia opened 2 years ago

Luobeia commented 2 years ago

你好,请问这个应该怎么运行,我在win10和vm的centos7上按照使用方法来操作,配置了两天环境还是不能运行,请问除了requirement.txt里的软件需要安装外,还需要安装什么吗,万分感谢

baabaaox commented 2 years ago

@Luobeia 卡在哪一步,报啥错误?

$ git clone https://github.com/baabaaox/ScrapyDouban.git
# 构建并运行容器
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# 进入 douban_scrapyd 容器
$ sudo docker exec -it douban_scrapyd bash
# 进入 scrapy 目录
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
Luobeia commented 2 years ago

sudo docker-compose up --build -d这一步就跟演示视频不同了,刚开始我去下docker和docker-compose解决了这两个命令不能识别的错误,我在centos下运行的,然后我这里报下面的错误,第一次做scrapy相关的项目,小白,忘见谅 Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/lib64/python3.6/http/client.py", line 1254, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1300, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 30, in connect sock.connect(self.unix_socket) FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/lib64/python3.6/http/client.py", line 1254, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1300, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 30, in connect sock.connect(self.unix_socket) urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 214, in _retrieve_server_version return self.version(api_version=False)["ApiVersion"] File "/usr/local/lib/python3.6/site-packages/docker/api/daemon.py", line 181, in version return self._result(self._get(url), json=True) File "/usr/local/lib/python3.6/site-packages/docker/utils/decorators.py", line 46, in inner return f(self, *args, kwargs) File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 237, in _get return self.get(url, self._set_request_timeout(kwargs)) File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 542, in get return self.request('GET', url, kwargs) File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, send_kwargs) File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/bin/docker-compose", line 8, in sys.exit(main()) File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 81, in main command_func() File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 200, in perform_command project = project_from_options('.', options) File "/usr/local/lib/python3.6/site-packages/compose/cli/command.py", line 70, in project_from_options enabled_profiles=get_profiles_from_options(options, environment) File "/usr/local/lib/python3.6/site-packages/compose/cli/command.py", line 153, in get_project verbose=verbose, version=api_version, context=context, environment=environment File "/usr/local/lib/python3.6/site-packages/compose/cli/docker_client.py", line 43, in get_client environment=environment, tls_version=get_tls_version(environment) File "/usr/local/lib/python3.6/site-packages/compose/cli/docker_client.py", line 170, in docker_client client = APIClient(use_ssh_client=not use_paramiko_ssh, **kwargs) File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 197, in init self._version = self._retrieve_server_version() File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 222, in _retrieve_server_version f'Error while fetching server API version: {e}' docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

baabaaox commented 2 years ago

@Luobeia 查了下你这个报错 https://github.com/docker/compose/issues/7896 好像是你 centos 的 docker 服务没启动起来 1.检查一下 docker 服务状态

 sudo systemctl status docker

2.如果没运行则启动 docker 服务

sudo systemctl start docker

3.再执行上面的

$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# 进入 douban_scrapyd 容器
$ sudo docker exec -it douban_scrapyd bash
# 进入 scrapy 目录
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
Luobeia commented 2 years ago

好的多谢,我去试试,有问题再来请教你

Luobeia commented 2 years ago

不好意思,又遇到问题了,ScrapyDouban/docker/Dockerfile里面有apt-get命令,我的这个centos用的是yum命令,我把它全改成yum命令运行不了,不改的话识别不了apt-get命令

Luobeia commented 2 years ago

上面那个问题解决了,打扰了,不过还是爬不了数据,估计是因为代理的问题?我后面再来研究研究代理,谢谢!!

baabaaox commented 2 years ago

@Luobeia 如果大量403的话,就需要用代理IP来解决

Luobeia commented 2 years ago

怎么看403呢,我看代码有error, ERROR: Gave up retrying <GET https://m.douban.com/movie/subject/1292052/> (failed 3 times): DNS lookup failed: no results for hostname lookup: m.douban.com. twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: m.douban.com.

baabaaox commented 2 years ago

@Luobeia DNS 解析失败了,是不是你虚拟机网络有啥问题 ,自己 ping 看看

ping m.douban.com 
Luobeia commented 2 years ago

好的,我去看看我网络是不是Ping不同,感谢!

Luobeia commented 2 years ago

大佬,想问一下那个是不是得先爬取电影的id才能爬取电影的数据啊,就是有多少个id就爬多少部电影

Luobeia commented 2 years ago

我下午还爬了1000多组数据,但是后面好像ip被封了,一条都爬不了了,还有那个数据在我centos系统里根本找不到是为啥

baabaaox commented 2 years ago

@Luobeia

  1. movie_subject spider 这行代码 里面的数组就是定义了从哪些页面开始搜集 douban id,原理就是 spider 抓取数组里面的链接, 然后查找页面里面是否有电影链接,有的话就提取出 douban id 返回给管道,再递进的爬取电影链接,理论上这个开始数组里面的链接范围要足够分散,足够多,spider 才能爬得足够远,不然它遇到爬过的链接它就会停止,你自己在数组里面多填一些有效的链接进去。
  2. ip 被封肯定就不能获取到你想要的数据了,数据存储在 docker 运行 的 mysql 容器里面的,通过你 centos 主机 IP:8080 访问数据库管理界面,登陆所需参数,服务器:mysql 用户名:root 密码:public
Luobeia commented 2 years ago

我用phpmyadmin的,在网页上输入,192.168.122.1:8080/phmyadmin,访问不了是为啥

Luobeia commented 2 years ago

看了一眼演示视频,用的是adminer,我自己去试试,没看到

Luobeia commented 2 years ago

请问一下,我想爬豆瓣里的预告片信息,用xpath定位,用浏览器插件检查也获得了网址,但是我修改原来的movie_meta.py文件,让official_site字段爬我想爬的信息,为啥不行