Open philipperolet opened 10 months ago
Factuality: RAG vs memory via Finetuning (or use both)?
What do you mean by memory? Finetuning requires a dataset. What would it be. Certainly more than a week work?
Factuality: survey techniques for detecting hallucinations
Looks a bit shallow. It's just an added step, but there's something interesting about model calibration that could be interesting (see below)
Chunking: optimal length
Not sure I see what would be the experiment here. It relies on existence of a benchmark right? Would you use a public benchmark for that (what are they?)
Chunking: optimal strategy (overlap? link handling? )
Same remark here?
Reasoning: using CoT or ToT in assistants
If it's in the context of assistants then it also rely on the existence of a benchmark that we trust no?
Reasoning: how to properly chain assistants
I think this is more product than "research"
Evaluation: how to assess answer quality
Looks like a requirement for many other ideas? Probably worth framing this a bit more because there is a lot of questions around that?
Internal benchmark: looks like many other ideas rely on the existence of a benchmark we trust. We could look into creating an internal benchmark for RAG setups. Many questions there. Do we freeze the RAG? Do we have ground truth RAG part of it? What would the benchmark look like? I presume some constitution about the content of the answer?
Explore model calibration: Models are supposed to be calibrated. Meaning that, if given an answer you ask a yes-no question about the answer (is this correct (Y/N), is this factual (Y/N), whatever), the probability of the answer verifying the attribute is supposed to be equal to the probability of the token Y
. This is also related to evals but could also have interesting product implications (give a confidence estimate of the answer. Does not feel like a burning product need but putting it out there)
The two above kind of work together in a sense as any benchmark we build will likely rely on the calibration of models...
Thanks a lot for the input. The topics I dropped were "pistes", admittedly not framed yet, but will give more detail about the ones I think worth when I get back to it. On your ideas, much more framed :) they would of course be directly good fit, the internal benchmark would certainly bring direct value
That said. Pausing on this topic for a bit, focusing on delivering Ahuna for now. Will get back to it soon
:scientist: R&D burst
Hey team ! Dust will soon make R&D burst, a 1-week exploration (that i'd do, with of course assistance welcome)on a given topic (e.g. "How to improve model factuality" or "what's the best chunking strategy").
Why
:information_source: Forging informed convictions about key technical elements of our business e.g. for model factuality, there is a d RAG vs using finetuning for memory => We currently favor RAG; we want to be able to support this with reasearch, so when a competitor comes doing FT, we can back our claim (and if it turns out FT can do good things, it's quite important that we know about it too)
:european_castle: Opportunity for a moat If the burst outcome is conclusive--e.g. we find out that we can greatly improve factuality using one key research idea + smart engineering--we make a bet on 3 months of R&D. Product gets a golden boost, strong differentiator VS potential competitors
:speaker: Research marketing Just like speculative sampling when I joined, this allow us to raise awareness about Dust in one of the best possible ways (be in people's minds as "the experts"). In general, and relatedly to the point about informed convictions, we want to show the world we are on top of topics that are key to our business (on top of = expert + key opinion leader)
Topic ideas
Brain dump, to be collectively completed:
At the time we have intuitions for most of those => we can turn them in experiment-backed convictions and ideally turn those convictions into features.
Input welcome
We will frame the burst more rigorously soon (topic decision issue, framing issue, etc.)--for now, ideation phase, gathering general ideas and feedback
Thanks :bow: