Open bolinocroustibat opened 3 months ago
Hey, I don't see any duplicated sources.
They are duplicated sheets (e.g F[0-9]{6}
), but sources are derived from chunks and sheets can contains multiple chunks. Si it is not surprising to find several times the same sheets referenced. But if you look at the context (the text in parenthesis, they are all different. The context is like a breadcrumb of the chunk inside the sheet. And finally, there is no direct links to chunks as they come from the same sheets, which explain why there are the se same links.
Ah, I missed the one that are actually real duplicated.
EDIT: my bad, I don"t see duplicate in fact, I confuse the source inside the answer and the actual sources with !sources.
@pedevineau Can you confirm you're OK with the current !sources
behaviour?
@dtrckd This was opened after some user's feedback, might be better to reopen while we make sure we all agree on the current behaviour
What we can be done is to add an anchor in links for each chunks. For example
https://www.service-public.fr/particuliers/vosdroits/F59#chunk1
Even if the anchor does not exist, it can give the user a hint of why this is the actual same URL.
Let me know if you have better idea.
How do we choose the titles related to chunks? My suggestion would be: let us return the title of the sheet once with the url. So it will be easy to deduplicate
The title of a chunk, is the tittle of the sheet it comes from. The subtitle(context) is the path towards that chunks in the sheet, which is composed by the successive subtitles meet before reaching the chunk. The subtitle is the string that enable us to deduplicate (we also use a hash of the chunk as a unique identifier internally). But again, there are no duplicated chunks, they are already deduplicated in the backend.
Yes I know there is no deduplicates of chunks, I was considering dedupling sheets, because at the end every url targets the same page. The anchor system doesn't work in general, because our chunks are not always related to the DILA webpage anchors, are them?
Yes, you're right, the anchor idea was just to give a visual hint.
We should de-duplicate similar sources.