Optimise chunk size - Githubissues

FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google

MIT License

0 stars 0 forks source link

Overview

In youtube_api.py, we split the transcript into a series of overlapping chunks. Smaller chunks (e.g. 1500 characters, or about 200-300 words) make links from the transcript to the video more fine grained but larger chunks (e.g. 5000 characters, c.1000 words) provide the model with more context which might improve quality.

Also, the prompts were using say things like "Find up to 5 claims that ...", meaning up to 5 claims per prompt. So without further changes, longer chunks means fewer claims and shorter chunks means more claims. That's not good! The number of genuine checkworthy claims in a video is obviously independent of the chunk size.

There is no perfect answer, especially in an MVP. The goal here is to do some quick exploration and hopefully learn something.

Requirements

[x] Explore impact of different chunk sizes on quality of claims found by model. Do very small chunks lack context and so fail to find good claims?
[x] Explore changing the prompt number "Find up to X claims..." or rephrase the prompt so there is no number included (something like "Find the most important claims..." or "Find any important claims...")
[x] Find a reasonable balance between these two features

Conclusions:

I tried chunks of 1500 (the current value), 500, and 3000.

500 was too small. Fewer claims came back and this is also the overlap size which caused frequent bugs.
3000 was too large. It returned fewer claims, because there was a limit to the number of claims per chunk.
1500 felt like the right balance.

I also changed the number of claims per chunk - I tried "up to 5 claims" (current value), "up to 10 claims" and simply "the claims".

when I asked for more claims (10), I found significantly more errors happened. I'm not sure why but it generally seemed to be formatting errors. The output could have been too long.
"the claims" returned more worth checking than "up to 5 claims" and the quality seemed better.

Therefore I believe the best route is to stick to chunks of 1500, but not to specify the number of claims. The model can pick out the correct number well.

FullFact / health-misinfo-shared

Optimise chunk size #114

Overview

Requirements