Open shariboodts opened 4 months ago
With regard to this issue, I could do the following: 1) Using AI, generate embeddings for all the shelfmarks. 2) For each shelfmark embedding (anchor), find N closest embeddings (targets) and decide whether any of the targets is a duplicate of the anchor. The decision can be based either on (eg., cosine) similarity threshold or via prompting a LLM itself: "Imagine that you are an experienced medievalist. Your task is to decide whether any of the prompted shelfmarks are duplicates.... " This approach, based on CPPM experience, would work quite reliably, although I have not measured it formally. 3) This can be subsequently given to human annotators for manual validation.
Please, let me know if you need my help in this process.
This issue consists of different parts:
Make room for the similarities: add a model for this
ManuscriptSimilar
to link a manuscript with a manuscript and give it a particular simval
. That value is set to 1.0
by default (assuming a cosine similarity value).Manuscript
with a field for convenience: similars
(pointing to the ManuscriptSimilar
)Showing doubles in listview:
Suspected doubles
get_similars_markdown()
(part of Manuscript
)Showing the (suspected) doubles in details view:
Label of the field: "Potential doubles for this record in the dataset" Contextual help: "Because data was important from different sources, it is possible there are multiple records in PASSIM for the same manuscript. This field shows an AI-generated, unverified list of 3 most likely doubles for this record, based on similarity of the shelfmark."
manu_dedups.json
inside media/passim. See the code in adaptations, adapt_similars()
CORRECTION Contextual help: "Because data was imported from different sources, it is possible there are multiple records in PASSIM for the same manuscript. This field shows an AI-generated, unverified list of 3 most likely doubles for this record, based on similarity of the shelfmark."
(My typo, sorry!)
If question with regard to Tab policy is what should happen when clicking one of the suggested doubles: Open in New Tab.
If question with regard to Tab policy is what should happen when clicking one of the suggested doubles: Open in New Tab.
There is no question about our tab policy, which is clearly defined in in issue #148 (i.e. never open something in a new tab upon clicking, since the user can do it him/herself by using Ctrl+Click or Right mouse button click).
What is done here, is an exception to that policy, and the user may not expect this. So what I did to signal it, is use a slightly different cursor (the so-called alias
cursor), something like:
CORRECTION Contextual help: "Because data was imported from different sources, it is possible there are multiple records in PASSIM for the same manuscript. This field shows an AI-generated, unverified list of 3 most likely doubles for this record, based on similarity of the shelfmark."
(My typo, sorry!)
Okay, corrected!
However, is the final phrase true: based on similarity of the shelfmark
?
The list of similarities that GS sends, is that based on similarity of the 'shelfmark' of one manuscript compared with others? Or is it based on similarity between the list of sermons of one manuscript and the list of sermons of a different manuscript?
Received the duplicates JSON file. Imported it via adaptations. Made sure only the top three are shown. Works.
GS notes:
I would write something like this:
“[… with not changes]. This field shows 3 nearest neighbors of the given shelfmark in terms of similarity of their semantic representations. Semantic representation aims to grasp the actual meaning of the shelfmark despite the possible typographic and orthographic discrepancies. Semantic representations were produced using “text-embedding-ada-002” embedding model by OpenAI. The lists of possible candidates were not verified manually and should be used with extreme caution”.
I would suggest a slightly different spelling: “[… with not changes]. This field shows 3 nearest neighbors of the given shelfmark in terms of similarity of their semantic representations. A semantic representation aims to grasp the actual meaning of the shelfmark, despite the possible typographic and orthographic discrepancies. The semantic representations were produced using the “text-embedding-ada-002” embedding model by OpenAI. The lists of possible candidates were not verified manually and should be used with extreme caution”.
We want to alert users to the existence of doubles in the dataset:
Generating these lists of doubles can be done using AI. Gleb will communicate about this.