DCT: clustering plugin background

ErwinKomen commented 6 months ago

Create a method to prepare data for the plugin dashboard.

This is the second part of issue #711

ErwinKomen commented 6 months ago

Implementation: part 2, data preparation method

glsch commented 3 months ago

Hi Erwin,

What follows is not anything urgent. Shari just asked me to describe here the different options I see for the future development of this plugin or, better, the functionalities it’s currently presenting within the PASSIM web application.

I see two options.

Develop only a method for fetching the data from the database. In this case, the plugin itself will receive only cosmetic improvements described in this issue and related issues. Although this is entirely viable, this will leave the plugin largely alien to the main body of the application, which is probably not very good in the mid-term and might even involve some sustainability risks if out-of-the-stack libraries used in the plugin become obsolete. Besides, the plugin in its current form was developed for experimental and research purposes, which means that its functionality is excessive where it's not needed (heatmap, I would say,) and at the very same time, insufficient where it would be helpful to have more option (e.g., colouring options, sorting, types of clustering and interactions with the datapoints).
The second option would be to retain only the very concept of the plugin and redevelop it from scratch with an idea to make it an integral part of the application. From a technical point of view, that would mean switching to tools currently used elsewhere in PASSIM for the querying and visualisation of the data. In this way, it would be possible not only to secure the safe work of the plugin but also rely on the functionalities of the plugin in many other processes in PASSIM, namely data retrieval, e.g., K nearest (most similar) manuscripts

In this case, it would be necessary to develop and method which would comprise several phases:

Fetching the data in which each manuscript will be represented as a series of sets of AF (since each manifestation can be ultimately linked to multiple AFs) with all the necessary metadata. This metadata can be later used for filtering, colouring, etc. This can be a fetch of the entire dataset or a part of it, based on the user’s query.
Computing distances pairwise distances between the extracted manuscripts. The algorithm would be the one underlying Levenshtein distance, but in our case we’ll have a special logic for how to determine whether two elements match. Since we represented manuscripts as lists of sets, I suggest considering a match as a non-zero intersection between the compared sets. Eg., ms_1 = [{1, 2}, {7,8,9}] and ms_2 = [{1, 12},{4,5}] would have one match as one of their respective indices (sets) share one element – 1. However, it might be beneficial to be able to define a different, say, stricter logic for the match. The cost of operations could also be adjustable so that we can make the algorithm biased, if needed. Being able to compute distances between lists of sermons of any kind would improve the heuristics in PASSIM: for DCT or any other purpose, it’ll be possible to rely on this measure that takes into account not only content overlap but also the organisation.
Once a distance matrix is obtained, the next step would be to apply dimension reduction with UMAP to project the dataset on the surface and visualize it. It is here that metadata will come into the game: colouring options abound.
Finally, one could also implement a visualisation of the best possible alignment of any two manuscripts, which is now missing. For this, I would suggest using a heatmap, each row representing a manuscript. Different traceback actions – deletion, insertion, replacement – will be represented by different colours. That would give the user an idea of the shared subsequences of sermons and will target any further qualitative analysis.

I’ve never worked with Django myself, but I would be interested to try. So, if within the time of the PASSIM or for Sven’s project, the decision will be made to try building this tool anew, I’d be happy to get involved more deeply. At any rate, I am always available for discussion of these and any other options.

ErwinKomen commented 2 months ago

2.1 Redevelopment: fetching the data

The idea here is:

each manuscript will be represented as a series of sets of AF (since each manifestation can be ultimately linked to multiple AFs) with all the necessary metadata.

ErwinKomen commented 2 months ago

2.2 Redevelopment: computing distances

ErwinKomen commented 2 months ago

2.3 Redevelopment: dimension reduction

ErwinKomen commented 2 months ago

ErwinKomen / RU-passim