AnswerDotAI / rerankers

A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
Apache License 2.0
1.04k stars 57 forks source link

Split document based on `input_ids` length and `max_position_embeddings` #32

Closed OllBroDer closed 1 month ago

OllBroDer commented 1 month ago

Hi there!

Fantastic library 😺

I was wondering if we could add the ability to split documents by max_position_embeddings instead of silently truncating them? Or, failing that, warn the user about the truncation?

On that note, maybe we could also allow for some transformers **kwargs in the model initializations just to accommodate quality of life things such as cache_dir for the model or truncate for the tokenizer.

Obviously, this is just related to the rankers that use Hugging Face.

EDIT: Apologies, in hindsight this should probably be 3-4 separate issues.

OllBroDer commented 1 month ago

Found this so closing.