FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

Pass plaintext chunk to the LLM (instead of a stringified dict) #107

Closed andylolz closed 4 months ago

andylolz commented 4 months ago

Just here: https://github.com/FullFact/health-misinfo-shared/blob/3a900f95285e5263a0610ce6f247df7fd39bf473/src/health_misinfo_shared/fine_tuning.py#L281

^^ chunk here is this sort of thing:

{'text': "foreign so a lot of people like to use baking soda in many ways some people use it as an exfoliant and some people use it as a spot treatment for acne now if you're using it and it's working well for you I have no problem if you're continuing to use it however I have a couple of issues with baking soda first of all it irritates a lot of people's skin and when you irritate the skin you damage the Skin Barrier and that can actually lead to more inflammation and more acne secondly our skin is meant to be at an acidic pH and baking soda is sodium bicarbonate which is alkaline so when you put baking soda on your face you're actually changing the pH of your skin and I don't think that's the best thing to do again our skin was meant to be an acidic pH and we should do everything we can to keep that equilibrium oh", 'start_offset': 0.179, 'end_offset': None}

^^ This isn’t valid JSON – we’re not using json.dumps. But also: the offset timestamps are probably not relevant, and don’t need to be passed.

Should this instead be:

prompt = f"{infer_prompt}\n```{chunk['text']}```" 
andylolz commented 4 months ago

Okay – looking back through the commit history, I am more confident that this is a bug. I think it was introduced in 39d756a (where chunk became a tuple) and got a bit worse in f4218d1b.