JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.48k stars 710 forks source link

10-20 times lower throughput of using SNScrape on a container in comparison to non containerized environment #946

Open alinazemian opened 1 year ago

alinazemian commented 1 year ago

Describe the bug

I am not sure what exactly could be wrong here but I thought to share my experience here in case anybody else faced the same issue.

Using SNScrape locally on a non-containerized environment takes about 0.2-0.3s on fetching tweets on average which is pretty good for our use case. However, when the same code runs on a container it causes the latency to be significantly higher (e.g. 3-5s). We thought perhaps the CPU throttling causing issues here or it was some sort of issue with warming up the container as we have been using SNScrape via the command line, so we attempted to test it as a service and the same issue persists. We have attempted to test it on Minikube and Kubernetes and the same issue persists. Below is the outcome of the profiling we did to understand what could be wrong here.

Profiling on a container: https://justpaste.it/alqiq

Profiling on a non-containerized env: https://justpaste.it/3frkx

Interestingly even running snscrape --version in a container is super slow!

My guess is we are using some libraries in SNscrape that is much faster in a non-containerized mode in comparison to when they run on a container. We haven't tested different base docker images yet.

How to reproduce

Run it on a container. SSH to the container and try snscrape --version

Expected behaviour

Expect to have relatively similar throughput when running SNScrape on a container vs non-containerized env.

Screenshots and recordings

No response

Operating system

Alpine Linux v3.12

Python version: output of python3 --version

3.8.10

snscrape version: output of snscrape --version

0.6.2.20230320

Scraper

twitter-user

How are you using snscrape?

CLI (snscrape ... as a command, e.g. in a terminal)

Backtrace

No response

Log output

No response

Dump of locals

No response

Additional context

No response

JustAnotherArchivist commented 1 year ago

Your profiles show that the big difference comes from importing modules, e.g. exec_module at 4.5 seconds in the container version vs 0.2 sec without. This sounds like you aren't precompiling modules when building the image, so they all get compiled on the first import, whereas the non-containerised system has the bytecode cached. Given what some of the modules import (e.g. bs4), that could easily add up to a few seconds. If you run snscrape multiple times inside the same container, you should see better runtimes from the second command onwards due to the bytecode caching.

pip install snscrape should compile by default, unless you're using some modified version of pip (e.g. distro). But you could try to explicitly pass the --compile option. Another option would be to execute snscrape --version or similar in the image building. I can't guarantee that this will stay true in the future though; lazy module loading has been on my extended wishlist for a while.

alinazemian commented 1 year ago

Thanks @JustAnotherArchivist Below is a copy of the docker file RUN section for your info.

RUN set -x \ && apk add gcc musl-dev\ && apk add --no-cache python3-dev="3.8.10-r0"\ && apk add --no-cache py3-pip\ && apk add libxml2-dev libxslt-dev\ && pip3 install --ignore-installed pipenv\ && pip3 install --force-reinstall bs4==0.0.1 snscrape==0.6.2.20230320

You are right I am not using the precompiled version and hence it gets faster on the subsequent run, but 3-5s is the outcome of running the command after some time. At the start, it is usually about 15-20s. Moreover, how about when it's been run as a module in a service? I'm facing the same extra latency when it's built as a service via Flask.

I will try it in a pre-compiled mode to see how much it gets better, but I wouldn't expect significant improvement given I'm facing almost similar latency in the service mode.

JustAnotherArchivist commented 1 year ago

The alternative is that the issue lies in the containerisation itself. At least I can't think of anything in snscrape that could cause something like this, and I know that past versions of seccomp have had issues in this area. Those should largely have been resolved over the years though, so assuming you're using reasonably recent versions of libseccomp and Docker, that shouldn't be the problem.

JustAnotherArchivist commented 1 year ago

I just remembered one other relevant difference: Alpine Linux uses musl, which is known to have lower performance than glibc for some things. You might want to play with images employing glibc for comparison, e.g. Debian.

alinazemian commented 1 year ago

@JustAnotherArchivist Nice! I'll check a Debian-based image and will get back to you if that helps.

alinazemian commented 1 year ago

@JustAnotherArchivist Just wanted to let you know that we have tested various images and looks like using none alpine based images could significantly improve the latency. The best image so far according to our tests has been python:3.8.10-slim-buster (~5x faster). It's still slower than running it locally but it's now acceptable. Thanks for your help.

JustAnotherArchivist commented 1 year ago

Good to hear there's an improvement. I don't know what the remaining difference could be apart from seccomp. I suppose you could try to disable that with --security-opt seccomp=unconfined to see if it makes any difference. If it does, maybe your Docker and/or libseccomp is outdated, or (which seems highly unlikely) the pattern of syscalls snscrape triggers are particularly bad for the isolation.