ErwinKomen / RU-passim

0 stars 0 forks source link

DCT: clustering plugin background #721

Open ErwinKomen opened 6 months ago

ErwinKomen commented 6 months ago

Create a method to prepare data for the plugin dashboard.

This is the second part of issue #711

ErwinKomen commented 6 months ago

Implementation: part 2, data preparation method

glsch commented 3 months ago

Hi Erwin,

What follows is not anything urgent. Shari just asked me to describe here the different options I see for the future development of this plugin or, better, the functionalities it’s currently presenting within the PASSIM web application.

I see two options.

In this case, it would be necessary to develop and method which would comprise several phases:

  1. Fetching the data in which each manuscript will be represented as a series of sets of AF (since each manifestation can be ultimately linked to multiple AFs) with all the necessary metadata. This metadata can be later used for filtering, colouring, etc. This can be a fetch of the entire dataset or a part of it, based on the user’s query.
  2. Computing distances pairwise distances between the extracted manuscripts. The algorithm would be the one underlying Levenshtein distance, but in our case we’ll have a special logic for how to determine whether two elements match. Since we represented manuscripts as lists of sets, I suggest considering a match as a non-zero intersection between the compared sets. Eg., ms_1 = [{1, 2}, {7,8,9}] and ms_2 = [{1, 12},{4,5}] would have one match as one of their respective indices (sets) share one element – 1. However, it might be beneficial to be able to define a different, say, stricter logic for the match. The cost of operations could also be adjustable so that we can make the algorithm biased, if needed. Being able to compute distances between lists of sermons of any kind would improve the heuristics in PASSIM: for DCT or any other purpose, it’ll be possible to rely on this measure that takes into account not only content overlap but also the organisation.
  3. Once a distance matrix is obtained, the next step would be to apply dimension reduction with UMAP to project the dataset on the surface and visualize it. It is here that metadata will come into the game: colouring options abound.
  4. Finally, one could also implement a visualisation of the best possible alignment of any two manuscripts, which is now missing. For this, I would suggest using a heatmap, each row representing a manuscript. Different traceback actions – deletion, insertion, replacement – will be represented by different colours. That would give the user an idea of the shared subsequences of sermons and will target any further qualitative analysis.

I’ve never worked with Django myself, but I would be interested to try. So, if within the time of the PASSIM or for Sven’s project, the decision will be made to try building this tool anew, I’d be happy to get involved more deeply. At any rate, I am always available for discussion of these and any other options.

ErwinKomen commented 2 months ago

2.1 Redevelopment: fetching the data

The idea here is:

each manuscript will be represented as a series of sets of AF (since each manifestation can be ultimately linked to multiple AFs) with all the necessary metadata.

ErwinKomen commented 2 months ago

2.2 Redevelopment: computing distances

ErwinKomen commented 2 months ago

2.3 Redevelopment: dimension reduction

ErwinKomen commented 2 months ago

2.4 Redevelopment: visualization[s]