Weblate scanning has no signs of progress

pontaoski commented 2 years ago

Describe the issue

When importing large repositories into Weblate, the scanning period takes a huge time without any indication of what exactly it's doing or how long it'll take.

I already tried

[X] I've read and searched the documentation.
[X] I've searched for similar issues in this repository.

Steps to reproduce the behavior

Go to create component
Import from VCS
Input the URL of a large repository (the one I'm importing is about 150k files)
Press continue
It hangs; no signs of life to the user

Expected behavior

Go to create component
Import from VCS
Import URL of a large repository
Press continue
It tells me what it's doing and how long it think it will take

Screenshots

No response

Exception traceback

No response

How do you run Weblate?

PyPI module

Weblate versions

Weblate: 4.10.1
Django: 4.0.2
siphashc: 2.1
translate-toolkit: 3.5.3
lxml: 4.7.0
Pillow: 8.4.0
bleach: 4.1.0
python-dateutil: 2.8.2
social-auth-core: 4.1.0
social-auth-app-django: 5.0.0
django-crispy-forms: 1.13.0
oauthlib: 3.2.0
django-compressor: 3.1
djangorestframework: 3.13.1
django-filter: 21.1
django-appconf: 1.0.5
user-agents: 2.2.0
filelock: 3.4.2
setuptools: 59.6.0
jellyfish: 0.8.9
openpyxl: 3.0.9
celery: 5.2.3
kombu: 5.2.3
translation-finder: 2.11
weblate-language-data: 2022.1
html2text: 2020.1.16
pycairo: 1.20.1
pygobject: 3.42.0
diff-match-patch: 20200713
requests: 2.26.0
django-redis: 5.2.0
hiredis: 2.0.0
sentry_sdk: 1.5.4
Cython: 0.29.27
misaka: 2.1.1
GitPython: 3.1.26
borgbackup: 1.1.17
pyparsing: 3.0.7
pyahocorasick: 1.4.2
python-redis-lock: 3.7.0
Python: 3.10.2
Git: 2.35.1
psycopg2-binary: 2.9.3
phply: 1.2.5
chardet: 4.0.0
ruamel.yaml: 0.17.20
tesserocr: 2.5.2
boto3: 1.20.49
zeep: 4.1.0
aeidon: 1.10.1
iniparse: 0.5
Mercurial: 6.0.2
git-svn: 2.35.1
git-review: 2.2.0
Redis server: 6.2.6
PostgreSQL server: 14.1
Database backends: django.db.backends.postgresql
Cache backends: default:RedisCache, avatar:FileBasedCache
Email setup: django.core.mail.backends.smtp.EmailBackend: smtp.sendgrid.net
OS encoding: filesystem=utf-8, default=utf-8
Celery: redis://localhost:6379, redis://localhost:6379, regular
Platform: Linux 5.17.0-0.rc0.20220112gitdaadb3bd0e8d.63.fc36.x86_64 (x86_64)

Weblate deploy checks

No response

Additional context

No response

tomkolp commented 2 years ago

I always use the console and docker logs during import. There is information about the progress of file processing. Unfortunately, I do not always have access to this console remotely.

nijel commented 2 years ago

The component creation has log visible in the application. The repository scanning merely consists of git clone...

nijel commented 2 years ago

To figure out what is really the expensive operation, you can try it without Weblate:

Get default branch (unless you specify it): git ls-remote --symref repo:url HEAD
Clone the repository, git clone --depth 1 --branch repo:branch repo:url repo:destination
Find translation files using translation-finder: translation-finder repo:destination

But with ~150k files, my guess would be as well that the translation-finder is the bottleneck here and https://github.com/WeblateOrg/weblate/issues/7251 could address this.

nijel commented 2 years ago

I've looked at the translation-finder and there is a lot of space to improve the performance there. https://github.com/WeblateOrg/translation-finder/commit/510ef7a2664d400b3f650a089cfbb3d6a051fdc2 should remove ~300k syscalls in your case.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because there wasn’t any recent activity.

It will be closed soon if no further action occurs.

Thank you for your contributions!

WeblateOrg / weblate