[Issue] List index out of range

MuratDoganer commented 1 year ago

Hey,

After the latest update I started to get this issue from Notion, and the answers do not reference Notion pages at all

Weves commented 1 year ago

@MuratDoganer would you be able to pull up the logs for the background job and post them here? The command from the instance running Danswer would be something like:

docker logs danswer-stack-background-1

You may need to try deleting the connector, then re-running it so that the error repeats so you can find it in the logs.

fan-wen commented 1 year ago

I had the same issue:

10/08/2023 12:00:26 PM update.py 389 : [Attempt ID: 4] Indexing job with ID '4' failed due to list index out of range Traceback (most recent call last): File "/app/danswer/background/update.py", line 378, in _run_indexing_entrypoint _run_indexing( File "/app/danswer/background/update.py", line 345, in _run_indexing _index(db_session, index_attempt, doc_batch_generator, run_time) File "/app/danswer/background/update.py", line 343, in _index raise e File "/app/danswer/background/update.py", line 290, in _index new_docs, total_batch_chunks = indexing_pipeline( ^^^^^^^^^^^^^^^^^^ File "/app/danswer/datastores/indexing_pipeline.py", line 123, in _indexing_pipeline insertion_records = document_index.index( ^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/datastores/vespa/store.py", line 404, in index return _index_vespa_chunks(chunks=chunks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/datastores/vespa/store.py", line 232, in _index_vespa_chunks chunk_already_existed = future.result() ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in get_result raise self._exception File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/datastores/vespa/store.py", line 139, in _index_vespa_chunk deletion_success = _delete_vespa_doc_chunks(document.id) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/datastores/vespa/store.py", line 116, in _delete_vespa_doc_chunks doc_chunk_ids = _get_vespa_chunk_ids_by_document_id(document_id) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/datastores/vespa/store.py", line 107, in _get_vespa_chunk_ids_by_document_id doc_chunk_ids.extend([hit["id"].split("::")[1] for hit in hits]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/danswer/datastores/vespa/store.py", line 107, in doc_chunk_ids.extend([hit["id"].split("::")[1] for hit in hits])


IndexError: list index out of range

Weves commented 1 year ago

We've added some additional logging in https://github.com/danswer-ai/danswer/pull/541. I've also ran into this issue, but I haven't been able to reproduce it consistently.

Will update here again once we run into it again and are able to check the new logs. Alternatively, if you pull the latest and run into this issue again, your logs would be really useful 🙇‍♂️

Pipboyguy commented 1 year ago

I'm getting the same issue with the File Connector IndexError: list index out of range

Weves commented 1 year ago

@Pipboyguy did the Danswer instance you're running have https://github.com/danswer-ai/danswer/pull/541 pulled in?

Pipboyguy commented 1 year ago

@Weves Yes, tested yesterday with latest commit

Weves commented 1 year ago

Hmm, @Pipboyguy any chance you could find / post the logs from the background container where you ran into this error? Would be really helpful for debugging 🙏

MuratDoganer commented 1 year ago

Hey, apologies I ended up getting Covid (apparently its still a thing these days) and had to take some time off work

I pulled the latest version today, removed the integration and added it back in and the error stopped popping up :)

Logs look clear as well

not sure what it was but something since the last update resolved it (at least for me), if others are still getting this issue then I would recommend the following:

Disable the integration then delete it (but dont remove the key) head into docker and update as normal I used docker compose -f docker-compose.prod.yml -p danswer-stack up -d --build --force-recreate Once back up, re-connect and enable the Notion integration and let it do its thing

Weves commented 1 year ago

Great! Glad to hear it's working now :D

Closing for now, but of course open this issue back up if you run into this problem again.

vidigalp commented 1 year ago

I'm getting the same error in the slack Connector it seems.

MuratDoganer commented 1 year ago

It has returned, Notion will now not complete at all, this is all I can see, but logs in the backend dont really say much either

Screenshot 2023-11-01 at 15 48 33

MuratDoganer commented 1 year ago

@Weves Might be worth reopening? I am on the latest build

Weves commented 1 year ago

Hmm, yea makes sense to re-open.

@MuratDoganer do you think you could pull the logs for one of these errors?

gius commented 1 year ago

Hi, I am getting a similar error when using the web connector:

2023-11-06 14:01:26 danswer-stack-background-1     | 11/06/2023 01:01:26 PM          document.py 209 : [Attempt ID: 1] Upserted 18 document store entries into DB
2023-11-06 14:01:27 danswer-stack-background-1     | 11/06/2023 01:01:27 PM            update.py 504 : Running update, current UTC time: 2023-11-06 13:01:27
2023-11-06 14:01:27 danswer-stack-background-1     | 11/06/2023 01:01:27 PM            update.py 442 : Found 0 new indexing tasks.
2023-11-06 14:01:28 danswer-stack-background-1     | 11/06/2023 01:01:28 PM            update.py 360 : [Attempt ID: 1] Failed connector elapsed time: 268.6359906196594 seconds
2023-11-06 14:01:28 danswer-stack-background-1     | 11/06/2023 01:01:28 PM            update.py 424 : [Attempt ID: 1] Indexing job with ID '1' failed due to list index out of range
2023-11-06 14:01:28 danswer-stack-background-1     | Traceback (most recent call last):
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/background/update.py", line 413, in _run_indexing_entrypoint
2023-11-06 14:01:28 danswer-stack-background-1     |     _run_indexing(
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/background/update.py", line 376, in _run_indexing
2023-11-06 14:01:28 danswer-stack-background-1     |     _index(db_session, index_attempt, doc_batch_generator, run_time)
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/background/update.py", line 374, in _index
2023-11-06 14:01:28 danswer-stack-background-1     |     raise e
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/background/update.py", line 311, in _index
2023-11-06 14:01:28 danswer-stack-background-1     |     new_docs, total_batch_chunks = indexing_pipeline(
2023-11-06 14:01:28 danswer-stack-background-1     |                                    ^^^^^^^^^^^^^^^^^^
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/indexing/indexing_pipeline.py", line 87, in _indexing_pipeline
2023-11-06 14:01:28 danswer-stack-background-1     |     chain(*[chunker.chunk(document=document) for document in documents])
2023-11-06 14:01:28 danswer-stack-background-1     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/indexing/indexing_pipeline.py", line 87, in <listcomp>
2023-11-06 14:01:28 danswer-stack-background-1     |     chain(*[chunker.chunk(document=document) for document in documents])
2023-11-06 14:01:28 danswer-stack-background-1     |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/indexing/chunker.py", line 167, in chunk
2023-11-06 14:01:28 danswer-stack-background-1     |     return chunk_document(document)
2023-11-06 14:01:28 danswer-stack-background-1     |            ^^^^^^^^^^^^^^^^^^^^^^^^
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/indexing/chunker.py", line 139, in chunk_document
2023-11-06 14:01:28 danswer-stack-background-1     |     blurb=extract_blurb(chunk_text, blurb_size),
2023-11-06 14:01:28 danswer-stack-background-1     |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-11-06 14:01:28 danswer-stack-background-1     |   File "/app/danswer/indexing/chunker.py", line 28, in extract_blurb
2023-11-06 14:01:28 danswer-stack-background-1     |     return blurb_splitter.split_text(text)[0]
2023-11-06 14:01:28 danswer-stack-background-1     |            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
2023-11-06 14:01:28 danswer-stack-background-1     | IndexError: list index out of range

You should be able to reproduce the problem trying to recursively scrape https://www.eman.cz/

kyleboddy commented 10 months ago

I am also running into this ingesting a large number of Slack documents/messages. I posted to the danswer slack Support channel as well.

kyleboddy commented 10 months ago

Logs when using grep to check and getting the 5 lines before/after those errors:

kyle@danswer:~$ sudo docker logs danswer-stack-background-1 | grep range -C 5
/usr/local/lib/python3.11/site-packages/supervisor/options.py:474: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  self.warnings.warn(
[2024-01-04 03:18:57,388: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
01/04/2024 03:18:58 AM          document.py 240 : [Attempt ID: 11] Upserted 16 document store entries into DB
01/04/2024 03:18:59 AM            timing.py  30 : [Attempt ID: 11] index_doc_batch took 0.564619779586792 seconds
01/04/2024 03:19:00 AM          document.py 240 : [Attempt ID: 11] Upserted 16 document store entries into DB
01/04/2024 03:19:00 AM      run_indexing.py 179 : [Attempt ID: 11] Connector run ran into exception after elapsed time: 7197.4683492183685 seconds
01/04/2024 03:19:00 AM      run_indexing.py 260 : [Attempt ID: 11] Indexing job with ID '11' failed due to list index out of range
Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 249, in run_indexing_entrypoint
    _run_indexing(
  File "/app/danswer/background/indexing/run_indexing.py", line 198, in _run_indexing
    raise e
--
    blurb=extract_blurb(chunk_text, blurb_size),
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/chunker.py", line 28, in extract_blurb
    return blurb_splitter.split_text(text)[0]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
01/04/2024 03:19:02 AM            update.py 317 : Running update, current UTC time: 2024-01-04 03:19:02
01/04/2024 03:19:02 AM            update.py 321 : Found existing indexing jobs: [(11, 'running')]
01/04/2024 03:19:02 AM            update.py 249 : Found 2 new indexing tasks.
[2024-01-04 03:19:02,804: INFO/MainProcess] Task check_for_document_sets_sync_task[d5143aef-95f9-4863-9f61-3586cf92a757] received
[2024-01-04 03:19:02,828: INFO/MainProcess] Task check_for_document_sets_sync_task[d5143aef-95f9-4863-9f61-3586cf92a757] succeeded in 0.023659816999497707s: None
--
01/04/2024 05:44:10 AM          document.py 197 : [Attempt ID: 14] `document_metadata_batch` is empty. Skipping.
01/04/2024 05:44:10 AM          document.py 240 : [Attempt ID: 14] Upserted 0 document store entries into DB
01/04/2024 05:44:10 AM            timing.py  30 : [Attempt ID: 14] index_doc_batch took 0.05938720703125 seconds
01/04/2024 05:44:11 AM          document.py 240 : [Attempt ID: 14] Upserted 16 document store entries into DB
01/04/2024 05:44:11 AM      run_indexing.py 179 : [Attempt ID: 14] Connector run ran into exception after elapsed time: 5938.845871925354 seconds
01/04/2024 05:44:11 AM      run_indexing.py 260 : [Attempt ID: 14] Indexing job with ID '14' failed due to list index out of range
Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 249, in run_indexing_entrypoint
    _run_indexing(
  File "/app/danswer/background/indexing/run_indexing.py", line 198, in _run_indexing
    raise e
--
    blurb=extract_blurb(chunk_text, blurb_size),
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/chunker.py", line 28, in extract_blurb
    return blurb_splitter.split_text(text)[0]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
[2024-01-04 05:44:12,682: INFO/MainProcess] Task check_for_document_sets_sync_task[8d1d74b8-d8fa-4d9c-bfb5-23b67947d261] received
[2024-01-04 05:44:12,706: INFO/MainProcess] Task check_for_document_sets_sync_task[8d1d74b8-d8fa-4d9c-bfb5-23b67947d261] succeeded in 0.022472063999884995s: None
[2024-01-04 05:44:12,484: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
01/04/2024 05:44:13 AM            update.py 317 : Running update, current UTC time: 2024-01-04 05:44:13
01/04/2024 05:44:13 AM            update.py 321 : Found existing indexing jobs: [(14, 'running')]

kyleboddy commented 10 months ago

Adding more logs here that also has the duration of the timeout. When I just try to ingest the "general" channel I run into the same problem.

[2024-01-04 07:32:57,554: INFO/MainProcess] Scheduler: Sending due task check-for-document-set-sync (check_for_document_sets_sync_task)
01/04/2024 07:32:58 AM          document.py 164 : [Attempt ID: 16] No documents to upsert. Skipping.
01/04/2024 07:32:58 AM          document.py 197 : [Attempt ID: 16] `document_metadata_batch` is empty. Skipping.
01/04/2024 07:32:58 AM          document.py 240 : [Attempt ID: 16] Upserted 0 document store entries into DB
01/04/2024 07:32:58 AM            timing.py  30 : [Attempt ID: 16] index_doc_batch took 0.05666708946228027 seconds
[2024-01-04 07:32:58,315: INFO/MainProcess] Task check_for_document_sets_sync_task[d416608c-d01d-4cc1-a95f-b075a7157654] received
[2024-01-04 07:32:58,338: INFO/MainProcess] Task check_for_document_sets_sync_task[d416608c-d01d-4cc1-a95f-b075a7157654] succeeded in 0.02200529600304435s: None
01/04/2024 07:32:59 AM          document.py 240 : [Attempt ID: 16] Upserted 16 document store entries into DB
01/04/2024 07:32:59 AM      run_indexing.py 179 : [Attempt ID: 16] Connector run ran into exception after elapsed time: 5916.849744796753 seconds
01/04/2024 07:32:59 AM      run_indexing.py 260 : [Attempt ID: 16] Indexing job with ID '16' failed due to list index out of range
Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 249, in run_indexing_entrypoint
    _run_indexing(
  File "/app/danswer/background/indexing/run_indexing.py", line 198, in _run_indexing
    raise e
  File "/app/danswer/background/indexing/run_indexing.py", line 143, in _run_indexing
    new_docs, total_batch_chunks = indexing_pipeline(
                                   ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/utils/timing.py", line 27, in wrapped_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/indexing_pipeline.py", line 150, in index_doc_batch
    chain(*[chunker.chunk(document=document) for document in updatable_docs])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/indexing_pipeline.py", line 150, in <listcomp>
    chain(*[chunker.chunk(document=document) for document in updatable_docs])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/chunker.py", line 173, in chunk
    return chunk_document(document)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/chunker.py", line 145, in chunk_document
    blurb=extract_blurb(chunk_text, blurb_size),
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/chunker.py", line 28, in extract_blurb
    return blurb_splitter.split_text(text)[0]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

kyleboddy commented 10 months ago

This was possibly addressed by this code change:

https://github.com/danswer-ai/danswer/pull/910

danswer-ai / danswer

[Issue] List index out of range #517