Daria-Oni / EcoHack-Babassu-bots

0 stars 1 forks source link

tokenization of data #14

Open Daria-Oni opened 1 month ago

Daria-Oni commented 1 month ago

splitting books or large pdfs into smaller pieces

We assume that a larger document size can make it harder for an LLM to extract each valuable piece of data. Therefore, we include a step to split the data into smaller pieces. Question: What is a reasonable number of pages or symbols we should have?

how to make sure we don't cut pages so that we don't interrupt important sentences?