Open alinazemian opened 1 year ago
Your profiles show that the big difference comes from importing modules, e.g. exec_module
at 4.5 seconds in the container version vs 0.2 sec without. This sounds like you aren't precompiling modules when building the image, so they all get compiled on the first import, whereas the non-containerised system has the bytecode cached. Given what some of the modules import (e.g. bs4), that could easily add up to a few seconds. If you run snscrape multiple times inside the same container, you should see better runtimes from the second command onwards due to the bytecode caching.
pip install snscrape
should compile by default, unless you're using some modified version of pip (e.g. distro). But you could try to explicitly pass the --compile
option. Another option would be to execute snscrape --version
or similar in the image building. I can't guarantee that this will stay true in the future though; lazy module loading has been on my extended wishlist for a while.
Thanks @JustAnotherArchivist Below is a copy of the docker file RUN section for your info.
RUN set -x \ && apk add gcc musl-dev\ && apk add --no-cache python3-dev="3.8.10-r0"\ && apk add --no-cache py3-pip\ && apk add libxml2-dev libxslt-dev\ && pip3 install --ignore-installed pipenv\ && pip3 install --force-reinstall bs4==0.0.1 snscrape==0.6.2.20230320
You are right I am not using the precompiled version and hence it gets faster on the subsequent run, but 3-5s is the outcome of running the command after some time. At the start, it is usually about 15-20s. Moreover, how about when it's been run as a module in a service? I'm facing the same extra latency when it's built as a service via Flask.
I will try it in a pre-compiled mode to see how much it gets better, but I wouldn't expect significant improvement given I'm facing almost similar latency in the service mode.
The alternative is that the issue lies in the containerisation itself. At least I can't think of anything in snscrape that could cause something like this, and I know that past versions of seccomp have had issues in this area. Those should largely have been resolved over the years though, so assuming you're using reasonably recent versions of libseccomp and Docker, that shouldn't be the problem.
I just remembered one other relevant difference: Alpine Linux uses musl, which is known to have lower performance than glibc for some things. You might want to play with images employing glibc for comparison, e.g. Debian.
@JustAnotherArchivist Nice! I'll check a Debian-based image and will get back to you if that helps.
@JustAnotherArchivist Just wanted to let you know that we have tested various images and looks like using none alpine based images could significantly improve the latency. The best image so far according to our tests has been python:3.8.10-slim-buster (~5x faster). It's still slower than running it locally but it's now acceptable. Thanks for your help.
Good to hear there's an improvement. I don't know what the remaining difference could be apart from seccomp. I suppose you could try to disable that with --security-opt seccomp=unconfined
to see if it makes any difference. If it does, maybe your Docker and/or libseccomp is outdated, or (which seems highly unlikely) the pattern of syscalls snscrape triggers are particularly bad for the isolation.
Describe the bug
I am not sure what exactly could be wrong here but I thought to share my experience here in case anybody else faced the same issue.
Using SNScrape locally on a non-containerized environment takes about 0.2-0.3s on fetching tweets on average which is pretty good for our use case. However, when the same code runs on a container it causes the latency to be significantly higher (e.g. 3-5s). We thought perhaps the CPU throttling causing issues here or it was some sort of issue with warming up the container as we have been using SNScrape via the command line, so we attempted to test it as a service and the same issue persists. We have attempted to test it on Minikube and Kubernetes and the same issue persists. Below is the outcome of the profiling we did to understand what could be wrong here.
Profiling on a container: https://justpaste.it/alqiq
Profiling on a non-containerized env: https://justpaste.it/3frkx
Interestingly even running
snscrape --version
in a container is super slow!My guess is we are using some libraries in SNscrape that is much faster in a non-containerized mode in comparison to when they run on a container. We haven't tested different base docker images yet.
How to reproduce
Run it on a container. SSH to the container and try
snscrape --version
Expected behaviour
Expect to have relatively similar throughput when running SNScrape on a container vs non-containerized env.
Screenshots and recordings
No response
Operating system
Alpine Linux v3.12
Python version: output of
python3 --version
3.8.10
snscrape version: output of
snscrape --version
0.6.2.20230320
Scraper
twitter-user
How are you using snscrape?
CLI (
snscrape ...
as a command, e.g. in a terminal)Backtrace
No response
Log output
No response
Dump of locals
No response
Additional context
No response