etalab-ia / albert-tchap

Bot for Tchap (the messaging app of the French State) using Albert, the French administration Artificial Intelligence agent
MIT License
13 stars 1 forks source link

Sources can be duplicated when similar sources for different chunks on !sources command #39

Open bolinocroustibat opened 3 months ago

bolinocroustibat commented 3 months ago

We should de-duplicate similar sources.

Screenshot 2024-05-26 at 18 50 27
dtrckd commented 3 months ago

Hey, I don't see any duplicated sources. They are duplicated sheets (e.g F[0-9]{6}), but sources are derived from chunks and sheets can contains multiple chunks. Si it is not surprising to find several times the same sheets referenced. But if you look at the context (the text in parenthesis, they are all different. The context is like a breadcrumb of the chunk inside the sheet. And finally, there is no direct links to chunks as they come from the same sheets, which explain why there are the se same links.

dtrckd commented 3 months ago

Ah, I missed the one that are actually real duplicated.

EDIT: my bad, I don"t see duplicate in fact, I confuse the source inside the answer and the actual sources with !sources.

bolinocroustibat commented 3 months ago

@pedevineau Can you confirm you're OK with the current !sources behaviour? @dtrckd This was opened after some user's feedback, might be better to reopen while we make sure we all agree on the current behaviour

dtrckd commented 3 months ago

What we can be done is to add an anchor in links for each chunks. For example

https://www.service-public.fr/particuliers/vosdroits/F59#chunk1

Even if the anchor does not exist, it can give the user a hint of why this is the actual same URL.

Let me know if you have better idea.

pedevineau commented 3 months ago

How do we choose the titles related to chunks? My suggestion would be: let us return the title of the sheet once with the url. So it will be easy to deduplicate

dtrckd commented 3 months ago

The title of a chunk, is the tittle of the sheet it comes from. The subtitle(context) is the path towards that chunks in the sheet, which is composed by the successive subtitles meet before reaching the chunk. The subtitle is the string that enable us to deduplicate (we also use a hash of the chunk as a unique identifier internally). But again, there are no duplicated chunks, they are already deduplicated in the backend.

pedevineau commented 3 months ago

Yes I know there is no deduplicates of chunks, I was considering dedupling sheets, because at the end every url targets the same page. The anchor system doesn't work in general, because our chunks are not always related to the DILA webpage anchors, are them?

dtrckd commented 3 months ago

Yes, you're right, the anchor idea was just to give a visual hint.