castorini / rank_llm

Repository for prompt-decoding using LLMs (GPT3.5, GPT4, Vicuna, and Zephyr)
http://rankllm.ai
Apache License 2.0
273 stars 35 forks source link

What about MS-Marco documents (not passages) #106

Closed marioeljuga closed 2 months ago

marioeljuga commented 3 months ago

Hello,

first of all, thank you for the numerous great papers and models.

My question is, why do most models train/test on the MS Marco passage dataset? Why not also use the MS Marco document corpus? As far as I know, only RankLLaMa delved into document datasets, while RankZephyr, monoT5, etc. only focus on passages. Documents provide lengthier texts, so the model trained on those could also work well for cases where the documents have many tokens.

ronakice commented 2 months ago

Yup, this is definitely a possibility. But usually with longer documents (especially something simple like MS MARCO task) what we end up doing, is segment/chunk the document, retrieve/rank the most relevant segments, then run inference over the best representative segment (with the passage trained models). This way you can run these models as is, with the representative segment. Right from Expando-Mono-Duo-T5 days (https://arxiv.org/abs/2101.05667) I don't know if the MS MARCO document task informs these models very well about document ranking.