ErwinKomen / RU-passim

0 stars 0 forks source link

Signaling doubles #726

Open shariboodts opened 4 months ago

shariboodts commented 4 months ago

We want to alert users to the existence of doubles in the dataset:

Generating these lists of doubles can be done using AI. Gleb will communicate about this.

glsch commented 2 months ago

With regard to this issue, I could do the following: 1) Using AI, generate embeddings for all the shelfmarks. 2) For each shelfmark embedding (anchor), find N closest embeddings (targets) and decide whether any of the targets is a duplicate of the anchor. The decision can be based either on (eg., cosine) similarity threshold or via prompting a LLM itself: "Imagine that you are an experienced medievalist. Your task is to decide whether any of the prompted shelfmarks are duplicates.... " This approach, based on CPPM experience, would work quite reliably, although I have not measured it formally. 3) This can be subsequently given to human annotators for manual validation.

Please, let me know if you need my help in this process.

ErwinKomen commented 1 month ago

This issue consists of different parts:

  1. Make room in the database to host manuscript-manuscript similarities (which manuscript appears to be equal to which other one)
  2. Signal the doubles in the Manuscript's listview and details view
  3. Background A.I. based calculation of manuscript-manuscript similarities
  4. The logic to perform the calculations at regular times and update the manuscript-manuscript similarities inside the DB
ErwinKomen commented 1 month ago

Implementation: room

Make room for the similarities: add a model for this

  1. Added model ManuscriptSimilar to link a manuscript with a manuscript and give it a particular simval. That value is set to 1.0 by default (assuming a cosine similarity value).
  2. Extended Manuscript with a field for convenience: similars (pointing to the ManuscriptSimilar)
ErwinKomen commented 1 month ago

Implementation: signal

  1. Signal doubles in the listview
    1. Added a column for the 'doubles' (is similars)
    2. If there are similars a checkmark is shown (otherwise nothing)

Showing doubles in listview: image

  1. List doubles in the details view
    1. Add field Suspected doubles
    2. Create function to fill this field: method get_similars_markdown() (part of Manuscript)
    3. Clickability: it is not our policy to open in a new tab. We did away with that in issue #148.

Showing the (suspected) doubles in details view: image

shariboodts commented 1 month ago

Label of the field: "Potential doubles for this record in the dataset" Contextual help: "Because data was important from different sources, it is possible there are multiple records in PASSIM for the same manuscript. This field shows an AI-generated, unverified list of 3 most likely doubles for this record, based on similarity of the shelfmark."

ErwinKomen commented 1 month ago

Actions

  1. Changed label to 'Potential doubles' (the remainder is too much; it appears when hovering)
  2. Added a contextual question mark and text as indicated above
  3. (June:) added an adaptations function to import and process a file manu_dedups.json inside media/passim. See the code in adaptations, adapt_similars()
shariboodts commented 1 month ago

CORRECTION Contextual help: "Because data was imported from different sources, it is possible there are multiple records in PASSIM for the same manuscript. This field shows an AI-generated, unverified list of 3 most likely doubles for this record, based on similarity of the shelfmark."

(My typo, sorry!)

shariboodts commented 1 month ago

If question with regard to Tab policy is what should happen when clicking one of the suggested doubles: Open in New Tab.

ErwinKomen commented 4 weeks ago

If question with regard to Tab policy is what should happen when clicking one of the suggested doubles: Open in New Tab.

There is no question about our tab policy, which is clearly defined in in issue #148 (i.e. never open something in a new tab upon clicking, since the user can do it him/herself by using Ctrl+Click or Right mouse button click).

What is done here, is an exception to that policy, and the user may not expect this. So what I did to signal it, is use a slightly different cursor (the so-called alias cursor), something like: alias

ErwinKomen commented 4 weeks ago

CORRECTION Contextual help: "Because data was imported from different sources, it is possible there are multiple records in PASSIM for the same manuscript. This field shows an AI-generated, unverified list of 3 most likely doubles for this record, based on similarity of the shelfmark."

(My typo, sorry!)

Okay, corrected!

However, is the final phrase true: based on similarity of the shelfmark? The list of similarities that GS sends, is that based on similarity of the 'shelfmark' of one manuscript compared with others? Or is it based on similarity between the list of sermons of one manuscript and the list of sermons of a different manuscript?

ErwinKomen commented 3 weeks ago

Received the duplicates JSON file. Imported it via adaptations. Made sure only the top three are shown. Works.

ErwinKomen commented 3 weeks ago

GS notes:

I would write something like this:

“[… with not changes]. This field shows 3 nearest neighbors of the given shelfmark in terms of similarity of their semantic representations. Semantic representation aims to grasp the actual meaning of the shelfmark despite the possible typographic and orthographic discrepancies. Semantic representations were produced using “text-embedding-ada-002” embedding model by OpenAI. The lists of possible candidates were not verified manually and should be used with extreme caution”.

I would suggest a slightly different spelling: “[… with not changes]. This field shows 3 nearest neighbors of the given shelfmark in terms of similarity of their semantic representations. A semantic representation aims to grasp the actual meaning of the shelfmark, despite the possible typographic and orthographic discrepancies. The semantic representations were produced using the “text-embedding-ada-002” embedding model by OpenAI. The lists of possible candidates were not verified manually and should be used with extreme caution”.