LinWeizheDragon / FLMR

The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.
42 stars 2 forks source link

confused about the doc_idx,when I split passage_ds and ds, doc_idxis vary large #13

Closed zzk2021 closed 1 month ago

zzk2021 commented 2 months ago

image

I look into the code, I do not understand the doc_idx, is the tokenizer of query? image

image when I runing, it raises error.

LinWeizheDragon commented 1 month ago

retrieved_docs is returned by the ColBERT engine. doc[0] is the index of document that the engine retrieves. The index refers to the index in passage_contents. For example, if the doc_id is 0, then the retrieved document is the first document in passage_contents. You have out of range index possibly because you changed the corpus size len(passage_contents) but you haven't regenerate the index. You should delete the previously generated index files and rerun the script with --run_indexing.