Ideation : R&D Burst - Githubissues

philipperolet commented 10 months ago

:scientist: R&D burst

Hey team ! Dust will soon make R&D burst, a 1-week exploration (that i'd do, with of course assistance welcome)on a given topic (e.g. "How to improve model factuality" or "what's the best chunking strategy").

Why

:information_source: Forging informed convictions about key technical elements of our business e.g. for model factuality, there is a d RAG vs using finetuning for memory => We currently favor RAG; we want to be able to support this with reasearch, so when a competitor comes doing FT, we can back our claim (and if it turns out FT can do good things, it's quite important that we know about it too)

:european_castle: Opportunity for a moat If the burst outcome is conclusive--e.g. we find out that we can greatly improve factuality using one key research idea + smart engineering--we make a bet on 3 months of R&D. Product gets a golden boost, strong differentiator VS potential competitors

:speaker: Research marketing Just like speculative sampling when I joined, this allow us to raise awareness about Dust in one of the best possible ways (be in people's minds as "the experts"). In general, and relatedly to the point about informed convictions, we want to show the world we are on top of topics that are key to our business (on top of = expert + key opinion leader)

Topic ideas

Brain dump, to be collectively completed:

Factuality: RAG vs memory via Finetuning (or use both)?
Factuality: survey techniques for detecting hallucinations
Chunking: optimal length
Chunking: optimal strategy (overlap? link handling? )
Reasoning: using CoT or ToT in assistants
Reasoning: how to properly chain assistants
Evaluation: how to assess answer quality

At the time we have intuitions for most of those => we can turn them in experiment-backed convictions and ideally turn those convictions into features.

Input welcome

We will frame the burst more rigorously soon (topic decision issue, framing issue, etc.)--for now, ideation phase, gathering general ideas and feedback

your opinion if any on above topics
topic suggestions
how to best do this for Dust
any relevent idea or comment, really

Thanks :bow:

spolu commented 10 months ago

Comments

Factuality: RAG vs memory via Finetuning (or use both)?

What do you mean by memory? Finetuning requires a dataset. What would it be. Certainly more than a week work?

Factuality: survey techniques for detecting hallucinations

Looks a bit shallow. It's just an added step, but there's something interesting about model calibration that could be interesting (see below)

Chunking: optimal length

Not sure I see what would be the experiment here. It relies on existence of a benchmark right? Would you use a public benchmark for that (what are they?)

Chunking: optimal strategy (overlap? link handling? )

Same remark here?

Reasoning: using CoT or ToT in assistants

If it's in the context of assistants then it also rely on the existence of a benchmark that we trust no?

Reasoning: how to properly chain assistants

I think this is more product than "research"

Evaluation: how to assess answer quality

Looks like a requirement for many other ideas? Probably worth framing this a bit more because there is a lot of questions around that?

Ideas

Internal benchmark: looks like many other ideas rely on the existence of a benchmark we trust. We could look into creating an internal benchmark for RAG setups. Many questions there. Do we freeze the RAG? Do we have ground truth RAG part of it? What would the benchmark look like? I presume some constitution about the content of the answer?
Explore model calibration: Models are supposed to be calibrated. Meaning that, if given an answer you ask a yes-no question about the answer (is this correct (Y/N), is this factual (Y/N), whatever), the probability of the answer verifying the attribute is supposed to be equal to the probability of the token Y. This is also related to evals but could also have interesting product implications (give a confidence estimate of the answer. Does not feel like a burning product need but putting it out there)

The two above kind of work together in a sense as any benchmark we build will likely rely on the calibration of models...

Continuing what I started here https://github.com/dust-tt/dust/compare/spolu-x is also a possibility. The roadmap here is clear, eval mistral/llama-2/gpt-* using sampling, CoT, ToT on Game of 24 and MATH (which I managed to turn into a multi step reasoning format quite easily using GPT-4 (still WIP)). This could be a great blogpost because these are interesting benchmarks (even if contaminated) that the scientific community care about. I haVe also a crazy idea for a new "strategy" to get better results, but that's probably possible only once we've done that. All of this align very well with the communs numerique project + ToT on MATH is an uncovered variable that could have meaningful impact on the community.

philipperolet commented 10 months ago

Thanks a lot for the input. The topics I dropped were "pistes", admittedly not framed yet, but will give more detail about the ones I think worth when I get back to it. On your ideas, much more framed :) they would of course be directly good fit, the internal benchmark would certainly bring direct value

That said. Pausing on this topic for a bit, focusing on delivering Ahuna for now. Will get back to it soon

dust-tt / dust

Ideation : R&D Burst #2572

:scientist: R&D burst

Why

Topic ideas

Input welcome

Comments

Ideas