No ETA for indexing, too slow, can't parallel index

pseudotensor commented 4 months ago

1) Indexing a few channels in slack, but it's at 80k "documents" by now and no end in sight. I have no idea how long it'll take, but it's been over a day so far. Since there is no progress percentage, I have no concept of when it'll be done.

2) Indexing seems relatively slow at about 1 slack "document per second. Not clear why it is so slow on fast internet + very fast system

3) Indexing of slack blocks all other indexing since I setup other connectors after slack one. I don't see why can't do parallel indexing, since given slack is so slow, must be possible to do in parallel. Even if I could manually start others would be nice.

A concern I have is there is no GPU usage. I followed the normal installation and startup in the docs.

https://docs.danswer.dev/quickstart

Is indexing also doing the embedding step? If not using GPUs that might explain why slow.

pseudotensor commented 4 months ago

Related for slack issue perhaps: https://github.com/danswer-ai/danswer/issues/1371

Will try this: https://github.com/danswer-ai/danswer/pull/1515/files

(base) jon@gpu:~/danswer/deployment/docker_compose$ docker compose -f docker-compose.gpu-dev.yml -p danswer-stack up -d --pull always --force-recreate

pseudotensor commented 4 months ago

Still no GPU usage during indexing, even with that docker-compose.gpu-dev.yml as far as I can tell. I at least see now 2 processes that are on GPU, but the GPU is at 0% mostly.

I see two of these on same GPU:

root      504303  7.8  0.4 44042692 3276776 ?    Ssl  18:45   0:31  \_ /usr/local/bin/python /usr/local/bin/uvicorn model_server.main:app --host 0.0.0.0 --port 9000

What is limiting the speed of connectors?

pseudotensor commented 4 months ago

In docker logs, I see this:

INFO:     172.19.0.7:46914 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:05:52 AM             utils.py  35 : embed_text took 0.013809442520141602 seconds
INFO:     172.19.0.7:46924 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:05:52 AM             utils.py  35 : embed_text took 0.013899564743041992 seconds
INFO:     172.19.0.7:46928 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:05:57 AM             utils.py  35 : embed_text took 0.04751729965209961 seconds
INFO:     172.19.0.7:46932 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:05:57 AM             utils.py  35 : embed_text took 0.014165401458740234 seconds
INFO:     172.19.0.7:46942 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:10 AM             utils.py  35 : embed_text took 0.06580519676208496 seconds
INFO:     172.19.0.7:47986 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:10 AM             utils.py  35 : embed_text took 0.0330958366394043 seconds
INFO:     172.19.0.7:48000 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:10 AM             utils.py  35 : embed_text took 0.012110471725463867 seconds
INFO:     172.19.0.7:48006 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:10 AM             utils.py  35 : embed_text took 0.01286935806274414 seconds
INFO:     172.19.0.7:48012 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:17 AM             utils.py  35 : embed_text took 0.0673372745513916 seconds
INFO:     172.19.0.7:48018 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:17 AM             utils.py  35 : embed_text took 0.014461994171142578 seconds
INFO:     172.19.0.7:48024 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:17 AM             utils.py  35 : embed_text took 0.01154470443725586 seconds
INFO:     172.19.0.7:48034 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK
05/31/2024 02:06:17 AM             utils.py  35 : embed_text took 0.013528823852539062 seconds
INFO:     172.19.0.7:48036 - "POST /encoder/bi-encoder-embed HTTP/1.1" 200 OK

This is consistent with about 80 documents/s capacity for embedding, so if I get 50 documents/minute or ~1 document/second, must be mostly transfer or other issue, but it's not clear what is the limit. Certainly not my internet speed.

pseudotensor commented 4 months ago

In other logs, seems to be consistent with 10 documents per 10 seconds, or about 1 doc per second. But I can't tell what takes the time:

05/31/2024 02:10:53 AM            timing.py  39 : [Attempt ID: 3] index_doc_batch took 9.987848043441772 seconds
05/31/2024 02:10:53 AM          document.py 244 : [Attempt ID: 3] Upserted 12 document store entries into DB
[2024-05-31 02:10:53,278: INFO/MainProcess] Task check_for_document_sets_sync_task[effc4962-3ea9-4419-873b-b57832096f8e] received
[2024-05-31 02:10:53,322: INFO/MainProcess] Task check_for_document_sets_sync_task[effc4962-3ea9-4419-873b-b57832096f8e] succeeded in 0.04216578043997288s: None
[2024-05-31 02:10:53,199: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
[2024-05-31 02:10:58,318: INFO/MainProcess] Task check_for_document_sets_sync_task[8ee9e4c6-57ab-4e98-8269-e1fcc212836d] received
[2024-05-31 02:10:58,362: INFO/MainProcess] Task check_for_document_sets_sync_task[8ee9e4c6-57ab-4e98-8269-e1fcc212836d] succeeded in 0.043132079765200615s: None
[2024-05-31 02:10:58,199: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
05/31/2024 02:11:02 AM            timing.py  39 : [Attempt ID: 3] index_doc_batch took 9.184157371520996 seconds
[2024-05-31 02:11:03,359: INFO/MainProcess] Task check_for_document_sets_sync_task[6bb2d879-99b0-4e4c-9f1e-b04146ff894f] received
[2024-05-31 02:11:03,406: INFO/MainProcess] Task check_for_document_sets_sync_task[6bb2d879-99b0-4e4c-9f1e-b04146ff894f] succeeded in 0.04587769601494074s: None
[2024-05-31 02:11:03,199: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
05/31/2024 02:11:08 AM          document.py 244 : [Attempt ID: 3] Upserted 8 document store entries into DB
[2024-05-31 02:11:08,411: INFO/MainProcess] Task check_for_document_sets_sync_task[dc083618-6f2b-471c-af78-d8838edcf1b9] received
[2024-05-31 02:11:08,457: INFO/MainProcess] Task check_for_document_sets_sync_task[dc083618-6f2b-471c-af78-d8838edcf1b9] succeeded in 0.04476207122206688s: None
[2024-05-31 02:11:08,199: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
[2024-05-31 02:11:13,460: INFO/MainProcess] Task check_for_document_sets_sync_task[27d07dc6-cb3f-4101-8f20-ef0cbcbe2e52] received
[2024-05-31 02:11:13,507: INFO/MainProcess] Task check_for_document_sets_sync_task[27d07dc6-cb3f-4101-8f20-ef0cbcbe2e52] succeeded in 0.045545514672994614s: None
[2024-05-31 02:11:13,199: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
05/31/2024 02:11:14 AM            timing.py  39 : [Attempt ID: 3] index_doc_batch took 6.108148574829102 seconds
05/31/2024 02:11:14 AM          document.py 244 : [Attempt ID: 3] Upserted 10 document store entries into DB

pseudotensor commented 4 months ago

Sometimes I see this:

05/31/2024 02:30:35 AM             utils.py  91 : [Attempt ID: 8] Slack call rate limited, retrying after 10 seconds. Exception: The request to the Slack API failed. (url: https://www.slack.com/api/conversations.replies)
The server responded with: {'ok': False, 'error': 'ratelimited'}

which i guess I can understand. Even github was slow though. I'll have to see how local files do.

pseudotensor commented 4 months ago

A new problem is that if slack sync fails, I only chose to "update" it. But it's going as slow as originally, which will then take another day just to notice that it already got all those documents. Seems the meta data should be used or something.

pseudotensor commented 4 months ago

Actually it sped up eventually for slack, I guess some kind of caching or meta checks kicked in:

onimsha commented 4 months ago

Be mindful that with Slack indexing, the indexer will face the rate-limiting event from the Slack API very often. It get worse when the job being stopped and need to restart, then it has to check ALL the previous indexed doc again before getting get docs, to accounts for new or delete messages.

I had to give up on Slack indexing. The intention is good but the Slack Connector atm really can't work with big workspace. May be the maintainers will need a different approach.

Not to mention, the quality of Danswer after getting data from Slack is much worse, when it occasionally throw out citations link to very old Slack thread ( I got an answer that point to a 6yrs old thread ). I checked the code, the Slack Connector will try to get everything from Slack, not the recent messages. The decay algorithm needs to have an overhaul to improve the quality of the Slack integration. For now I would not recommend using it. IMHO.

pseudotensor commented 4 months ago

Understood. I've been getting more familiar with danswer. It has alot of positives, and maybe with role based auth enterprise can be useful, but yes alot more work to make things reliable and fast and understandable -- and work well.

kziovas commented 3 months ago

Hello, also trying out Danswer now with a very small zip file of 20 MB, it has about 14k small files in it (comments of users basically) indexing is suuuuper slow like 29/minute. Are there any option to speed it up?

Note: Also I have manually set the ENABLE_MINI_CHUNK variable to false to be sure this is not causing the issue.

Drewster87 commented 2 months ago

It looks like you fixed the indexer not using your GPU but there's no indication as to how. I'm having the same two issues, 1-4 docs indexed per second and it fails after like 64 docs.

rajivml commented 1 month ago

facing the same issue, while there is no problem indexing small channels but when I have to index chatty channels , scrapping is always failing in between and it should have ideally resumed from where it has been left but it's trying to scrape the whole thing again on a retry

`Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 299, in _urllib_api_call response_body_data = json.loads(response["body"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/app/danswer/background/indexing/run_indexing.py", line 168, in _run_indexing for doc_batch in doc_batch_generator: File "/app/danswer/connectors/slack/connector.py", line 362, in poll_source for document in get_all_docs( File "/app/danswer/connectors/slack/connector.py", line 299, in get_all_docs for message_batch in channel_message_batches: File "/app/danswer/connectors/slack/connector.py", line 121, in get_channel_messages for result in _make_paginated_slack_api_call( File "/app/danswer/connectors/slack/utils.py", line 60, in paginated_call response = call(cursor=cursor, limit=_SLACK_LIMIT, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/connectors/cross_connector_utils/retry_wrapper.py", line 38, in wrapped_func return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/connectors/slack/utils.py", line 81, in rate_limited_call response = call(kwargs) ^^^^^^^^^^^^^^ File "/app/danswer/connectors/slack/utils.py", line 43, in logged_call result = call(*kwargs) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/client.py", line 2380, in conversations_history return self.api_call("conversations.history", http_verb="GET", params=kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 156, in api_call return self._sync_send(api_url=api_url, req_args=req_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 187, in _sync_send return self._urllib_api_call( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 302, in _urllib_api_call raise err.SlackApiError(message, response) slack_sdk.errors.SlackApiError: Received a response in a non-JSON format: The server responded with: {'status': 500, 'headers': {'date': 'Fri, 23 Aug 2024 20:58:42 GMT', 'server': 'Apache', 'vary': 'Accept-Encoding', 'x-accepted-oauth-scopes': 'channels:history,groups:history,mpim:history,im:history,read', 'x-oauth-scopes': 'channels:history,channels:read,groups:history,groups:read,channels:join,im:history,users:read', 'access-control-expose-headers': 'x-slack-req-id, retry-after', 'access-control-allow-headers': 'slack-route, x-slack-version-ts, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'referrer-policy': 'no-referrer', 'x-slack-unique-id': 'Zsj4AMbHwSvGHvXQBkmo3gAAEAY', 'x-slack-backend': 'r', 'access-control-allow-origin': '', 'content-type': 'text/html', 'content-length': '0', 'via': '1.1 slack-prod.tinyspeck.com, envoy-www-iad-ibxziztm, envoy-edge-fra-tisqvfzx', 'x-envoy-attempt-count': '1', 'x-envoy-upstream-service-time': '1616', 'x-backend': 'main_normal main_canary_with_overflow main_control_with_overflow', 'x-server': 'slack-www-hhvm-main-iad-kxze', 'x-slack-shared-secret-outcome': 'no-match', 'x-edge-backend': 'envoy-www', 'x-slack-edge-shared-secret-outcome': 'no-match', 'connection': 'close'}, 'body': ''} `

`Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 299, in _urllib_api_call response_body_data = json.loads(response["body"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/app/danswer/connectors/slack/utils.py", line 80, in rate_limited_call response = call(kwargs) ^^^^^^^^^^^^^^ File "/app/danswer/connectors/slack/utils.py", line 43, in logged_call result = call(kwargs) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/client.py", line 2380, in conversations_history return self.api_call("conversations.history", http_verb="GET", params=kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 156, in api_call return self._sync_send(api_url=api_url, req_args=req_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 187, in _sync_send return self._urllib_api_call( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/slack_sdk/web/base_client.py", line 302, in _urllib_api_call raise err.SlackApiError(message, response) slack_sdk.errors.SlackApiError: Received a response in a non-JSON format: The server responded with: {'status': 500, 'headers': {'date': 'Fri, 23 Aug 2024 11:08:23 GMT', 'server': 'Apache', 'vary': 'Accept-Encoding', 'x-accepted-oauth-scopes': 'channels:history,groups:history,mpim:history,im:history,read', 'x-oauth-scopes': 'channels:history,channels:read,groups:history,groups:read,channels:join,im:history,users:read', 'access-control-expose-headers': 'x-slack-req-id, retry-after', 'access-control-allow-headers': 'slack-route, x-slack-version-ts, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'referrer-policy': 'no-referrer', 'x-slack-unique-id': 'ZshtpSBVDKInu5clOoO8iQAAEAo', 'x-slack-backend': 'r', 'access-control-allow-origin': '*', 'content-type': 'text/html', 'content-length': '0', 'via': '1.1 slack-prod.tinyspeck.com, envoy-www-iad-tsjulols, envoy-edge-fra-prbqmelf', 'x-envoy-attempt-count': '1', 'x-envoy-upstream-service-time': '2314', 'x-backend': 'main_normal main_canary_with_overflow main_control_with_overflow', 'x-server': 'slack-www-hhvm-main-iad-lnyc', 'x-slack-shared-secret-outcome': 'no-match', 'x-edge-backend': 'envoy-www', 'x-slack-edge-shared-secret-outcome': 'no-match', 'connection': 'close'}, 'body': ''}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/app/danswer/background/indexing/run_indexing.py", line 168, in _run_indexing for doc_batch in doc_batch_generator: File "/app/danswer/connectors/slack/connector.py", line 362, in poll_source for document in get_all_docs( File "/app/danswer/connectors/slack/connector.py", line 299, in get_all_docs for message_batch in channel_message_batches: File "/app/danswer/connectors/slack/connector.py", line 121, in get_channel_messages for result in _make_paginated_slack_api_call( File "/app/danswer/connectors/slack/utils.py", line 60, in paginated_call response = call(cursor=cursor, limit=_SLACK_LIMIT, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/connectors/cross_connector_utils/retry_wrapper.py", line 38, in wrapped_func return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/connectors/slack/utils.py", line 88, in rate_limited_call if e.response["error"] == "ratelimited":
KeyError: 'error'
`

emerzon commented 1 week ago

Note that recently a new feature was introduced that allows you to increase the number of indexing workers for parallel connector processing via the env var NUM_INDEXING_WORKER. It will not multithread indexing of a single source but will at least prevent that a large connector block others sources from being indexed.

danswer-ai / danswer

No ETA for indexing, too slow, can't parallel index #1546