关于运行环境 - Githubissues

Luobeia commented 2 years ago

你好，请问这个应该怎么运行，我在win10和vm的centos7上按照使用方法来操作，配置了两天环境还是不能运行，请问除了requirement.txt里的软件需要安装外，还需要安装什么吗，万分感谢

baabaaox commented 2 years ago

@Luobeia 卡在哪一步，报啥错误？

$ git clone https://github.com/baabaaox/ScrapyDouban.git
# 构建并运行容器
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# 进入 douban_scrapyd 容器
$ sudo docker exec -it douban_scrapyd bash
# 进入 scrapy 目录
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list

Luobeia commented 2 years ago

sudo docker-compose up --build -d这一步就跟演示视频不同了，刚开始我去下docker和docker-compose解决了这两个命令不能识别的错误，我在centos下运行的，然后我这里报下面的错误，第一次做scrapy相关的项目，小白，忘见谅 Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/lib64/python3.6/http/client.py", line 1254, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1300, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 30, in connect sock.connect(self.unix_socket) FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/lib64/python3.6/http/client.py", line 1254, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1300, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 30, in connect sock.connect(self.unix_socket) urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 214, in _retrieve_server_version return self.version(api_version=False)["ApiVersion"] File "/usr/local/lib/python3.6/site-packages/docker/api/daemon.py", line 181, in version return self._result(self._get(url), json=True) File "/usr/local/lib/python3.6/site-packages/docker/utils/decorators.py", line 46, in inner return f(self, *args, kwargs) File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 237, in _get return self.get(url, self._set_request_timeout(kwargs)) File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 542, in get return self.request('GET', url, kwargs) File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, send_kwargs) File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/bin/docker-compose", line 8, in sys.exit(main()) File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 81, in main command_func() File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 200, in perform_command project = project_from_options('.', options) File "/usr/local/lib/python3.6/site-packages/compose/cli/command.py", line 70, in project_from_options enabled_profiles=get_profiles_from_options(options, environment) File "/usr/local/lib/python3.6/site-packages/compose/cli/command.py", line 153, in get_project verbose=verbose, version=api_version, context=context, environment=environment File "/usr/local/lib/python3.6/site-packages/compose/cli/docker_client.py", line 43, in get_client environment=environment, tls_version=get_tls_version(environment) File "/usr/local/lib/python3.6/site-packages/compose/cli/docker_client.py", line 170, in docker_client client = APIClient(use_ssh_client=not use_paramiko_ssh, **kwargs) File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 197, in init self._version = self._retrieve_server_version() File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 222, in _retrieve_server_version f'Error while fetching server API version: {e}' docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

baabaaox commented 2 years ago

@Luobeia 查了下你这个报错 https://github.com/docker/compose/issues/7896 好像是你 centos 的 docker 服务没启动起来 1.检查一下 docker 服务状态

 sudo systemctl status docker

2.如果没运行则启动 docker 服务

sudo systemctl start docker

3.再执行上面的

$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# 进入 douban_scrapyd 容器
$ sudo docker exec -it douban_scrapyd bash
# 进入 scrapy 目录
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list

Luobeia commented 2 years ago

好的多谢，我去试试，有问题再来请教你

Luobeia commented 2 years ago

不好意思，又遇到问题了，ScrapyDouban/docker/Dockerfile里面有apt-get命令，我的这个centos用的是yum命令，我把它全改成yum命令运行不了，不改的话识别不了apt-get命令

Luobeia commented 2 years ago

上面那个问题解决了，打扰了,不过还是爬不了数据，估计是因为代理的问题？我后面再来研究研究代理，谢谢！！

baabaaox commented 2 years ago

@Luobeia 如果大量403的话，就需要用代理IP来解决

Luobeia commented 2 years ago

怎么看403呢，我看代码有error, ERROR: Gave up retrying <GET https://m.douban.com/movie/subject/1292052/> (failed 3 times): DNS lookup failed: no results for hostname lookup: m.douban.com. twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: m.douban.com.

baabaaox commented 2 years ago

@Luobeia DNS 解析失败了，是不是你虚拟机网络有啥问题，自己 ping 看看

ping m.douban.com

Luobeia commented 2 years ago

好的，我去看看我网络是不是Ping不同，感谢！

Luobeia commented 2 years ago

大佬，想问一下那个是不是得先爬取电影的id才能爬取电影的数据啊，就是有多少个id就爬多少部电影

Luobeia commented 2 years ago

我下午还爬了1000多组数据，但是后面好像ip被封了，一条都爬不了了，还有那个数据在我centos系统里根本找不到是为啥

baabaaox commented 2 years ago

@Luobeia

movie_subject spider 这行代码里面的数组就是定义了从哪些页面开始搜集 douban id，原理就是 spider 抓取数组里面的链接，然后查找页面里面是否有电影链接，有的话就提取出 douban id 返回给管道，再递进的爬取电影链接，理论上这个开始数组里面的链接范围要足够分散，足够多，spider 才能爬得足够远，不然它遇到爬过的链接它就会停止，你自己在数组里面多填一些有效的链接进去。
ip 被封肯定就不能获取到你想要的数据了，数据存储在 docker 运行的 mysql 容器里面的，通过你 centos 主机 IP:8080 访问数据库管理界面，登陆所需参数，服务器:mysql 用户名:root 密码:public

Luobeia commented 2 years ago

我用phpmyadmin的，在网页上输入，192.168.122.1:8080/phmyadmin,访问不了是为啥

Luobeia commented 2 years ago

看了一眼演示视频，用的是adminer，我自己去试试，没看到

Luobeia commented 2 years ago

请问一下，我想爬豆瓣里的预告片信息，用xpath定位，用浏览器插件检查也获得了网址，但是我修改原来的movie_meta.py文件，让official_site字段爬我想爬的信息，为啥不行

baabaaox / ScrapyDouban

关于运行环境 #23