FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

Optimise chunk size #114

Closed dcorney closed 3 months ago

dcorney commented 4 months ago

Overview

In youtube_api.py, we split the transcript into a series of overlapping chunks. Smaller chunks (e.g. 1500 characters, or about 200-300 words) make links from the transcript to the video more fine grained but larger chunks (e.g. 5000 characters, c.1000 words) provide the model with more context which might improve quality.

Also, the prompts were using say things like "Find up to 5 claims that ...", meaning up to 5 claims per prompt. So without further changes, longer chunks means fewer claims and shorter chunks means more claims. That's not good! The number of genuine checkworthy claims in a video is obviously independent of the chunk size.

There is no perfect answer, especially in an MVP. The goal here is to do some quick exploration and hopefully learn something.

Requirements

c-j-johnston commented 3 months ago

Conclusions:

I tried chunks of 1500 (the current value), 500, and 3000.

I also changed the number of claims per chunk - I tried "up to 5 claims" (current value), "up to 10 claims" and simply "the claims".

Therefore I believe the best route is to stick to chunks of 1500, but not to specify the number of claims. The model can pick out the correct number well.