ErwinKomen / RU-passim

0 stars 0 forks source link

DCT: incorporate clustering plug-in #711

Closed ErwinKomen closed 9 months ago

ErwinKomen commented 10 months ago

Integrate the clustering plug-in that we talked about.

ErwinKomen commented 10 months ago

Explanation

The passim-plugin consists of two basic parts

  1. The dashboard, which works on 'prepared' data - first part
    1. Outward appearance (front end)
    2. Functions interacting with the plugin behind the 'page'
    3. The plugin itself (already created by GS)
  2. A method to prepare data for the dashboard - second part
ErwinKomen commented 9 months ago

Implementation: part 1, user dashboard

  1. Added models (database tables) to facilitate a dashboard:
    1. BoardDataset: points to a location on the server that holds pre-calculated dataset data
    2. SermonsDistance: list of methods to measure distance between sermons - pre-set via admin interface
    3. SeriesDistance: list of methods to measure distance between series - pre-set via admin interface
    4. Dimension: choice between 2d and 3d (for the moment; pre-set via admin)
    5. ClMethod; list of clustering methods that can be used - preset via the admin interface
    6. Highlight: list of fields and other things that can be used as a 'highlight'. Two main components:
      1. A number of fields from the SermonDescr: library, idno, lcity, lcountry, date, total, sermons, content, century, age, is_emblamatic
      2. Full manuscript names, using their standard identification (city, library, shelfmark)
  2. Added Tools > Plugin, a link to the dashboard, which is in plugin/view.py sermboard
    1. This makes available to the user all the data that can be chosen from via BoardForm + plugin/sermonboard.html
  3. Create space for the datasets on the server
    1. When we go to a container, this should be 'externally linked'
    2. Whether via link or not, the location will be MEDIA_DIR/plugin/preprocessed_data/...
  4. The datasets need to be 'loaded' in order to be usable. When to load them?
    1. Use a global store in calculate.py
    2. Load this store, as soon as a new object GenGraph is created
      1. If it is slow: make it a separate thread working background
  5. The dashboard possibilities should be dependant on the tab that is chosen
    1. Umap: initial highlight should be lcountry
ErwinKomen commented 9 months ago

Remaining issues

  1. Implement user interface response on changing tab pages
    1. Make use of <div groups umap_params and clustering_params - works
  2. Implementing Clustering:
    1. Cannot find name linkage in calculations
      1. This is a function from scipy/cluster/hierarchy.py
      2. Should have entered the scene via from scipy.cluster.hierarchy import *
    2. Okay, is working now!
  3. Implementing Umap: is working
  4. Implementing Series Heatmap: 1.
  5. Implementing Sermons Heatmap:
    1. No figure being produced...
    2. Error "None of Code in ..." - the correct combination of Sermons Distance and Series Distance is required
Filter Clustering Umap Series Heatmap Sermons Heatmap
Minimal collection length 5 5 5 5
Sermons + + + +
Anchor manuscript + + + +
Number of closest manuscripts 10 10 10 10
Target dimension - 2D - -
Highlight - century - -
Number of neighbours - 10 - -
Minimal distance - 0.1 - -
Clustering method ward - - -
ErwinKomen commented 9 months ago

Follow-up: see issue #733