MaxHalford / maxhalford.github.io

:house_with_garden: Personal website
https://maxhalford.github.io
MIT License
12 stars 5 forks source link

blog/carbonfact-nlp-open-problem/ #23

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

NLP at Carbonfact: how would you do it? • Max Halford

The task I work at a company called Carbonfact. Our core value proposal is computing the carbon footprint of clothing items, expressed in carbon dioxide equivalent – $kgCO_2e$ in short. For instance, we started by measuring the footprint of shoes – no pun intended. We do these measurements with life cycle analysis (LCA) software we built ourselves. We use these analyses to fuel higher-level tasks for our clients, such as carbon accounting and sustainable procurement.

https://maxhalford.github.io/blog/carbonfact-nlp-open-problem/

raphaelsty commented 1 year ago

Here is my attempt to treat the problem as a slot filling problem (extractive question answering + entity linking). It's nice that you shared a sample of the data (with annotations). ✌️

jasmeet0817 commented 9 months ago

Hi Max,

This is Jasmeet, I currently work at Google but recently quit to move into the LCA space. I've been researching on Carbonfact as I find your solution quite interesting and needful in the industry right now, even spoke to some people there. Anyways, my research led me to this.

I think this problem might be ideal to solve with an LLM prompt. It should be able to recognize typos, different formats, different separators etc. If not, you can specifically mention it to clean the text before attempting restructuring the data. If the prompt is too large and goes over the prompt length limit, you can try chain of thought prompting.

If the prompt yields good results but accuracy can be improved, you can try fine tuning (although I wouldn't recommend). Happy to connect more if you're curious about this solution, I think it's fairly low effort high reward.

MaxHalford commented 9 months ago

Hey there Jasmeet, glad to e-meet you.

I think this problem might be ideal to solve with an LLM prompt. It should be able to recognize typos, different formats, different separators etc. If not, you can specifically mention it to clean the text before attempting restructuring the data. If the prompt is too large and goes over the prompt length limit, you can try chain of thought prompting.

I completely agree. In fact, I am connected to GitHub Copilot on my Visual Studio Code editor. It's really good at normalizing and parsing compositions. I have no doubt an LLM is a powerful solution to this problem. My issue, however, is that I have thousands of compositions to parse in under a few minutes. I don't want to spend time setting up something complicated for this to work. Moreover, I want to be "in the loop", so that I have a way to fix whatever mistake the LLM makes. Admittedly, a lot of this comes down to me spending time on the problem. I simply do not have time to give LLMs a serious shot at this problem. #startuplife

If the prompt yields good results but accuracy can be improved, you can try fine tuning (although I wouldn't recommend).

I agree fine-tuning doesn't seem viable in general.

Happy to connect more if you're curious about this solution, I think it's fairly low effort high reward.

Again, I wouldn't be so sure. But I'd be glad to be wrong.

jasmeet0817 commented 9 months ago

Hello. Thanks for the context.

Speed - In my experience with LLM models, they compute quite fast, the only bottleneck is the prompt length, which I understand is your concern here, since you have 1000s of compositions. But AFAICT, you could split these into X compositions for each prompt (or just 1 per prompt) and send parallel RPCs to evaluate. You should ideally get results back in seconds (if the server is close enough - which you can control).

Feedback loop: I think this is about accuracy, I can understand that you want 100% accuracy here, but ML is never 100%. There are some tricks with LLM prompting (that don't always work) to not parse a composition if it's unsure, and this would indicate to you which composition it was unsure about.

But yeah, in the end any ML solution comes down to trying it out. Good luck :)

MaxHalford commented 9 months ago

Speed - In my experience with LLM models, they compute quite fast, the only bottleneck is the prompt length, which I understand is your concern here, since you have 1000s of compositions. But AFAICT, you could split these into X compositions for each prompt (or just 1 per prompt) and send parallel RPCs to evaluate. You should ideally get results back in seconds (if the server is close enough - which you can control).

That's super smart, good thinking outside the box! I did this a while ago, so I didn't really have in mind this batching approach. I'll keep it in mind.