LD4P / sinopia_editor

Sinopia Linked Data Editor
https://sinopia.io/
Apache License 2.0
35 stars 10 forks source link

Intelligently auto populate fields using data mining and machine learning for ease of use and to minimize selecting entity terms in drop downs... #2766

Open jimfhahn opened 3 years ago

jimfhahn commented 3 years ago

Describe the feature you'd like

Broadly, I would like to make it possible for the editor to intelligently auto populate fields using data mining and machine learning. A database of association rules I am building from ivy+ POD data could be used to provide within the bf:instance entities Publisher --> Publisher Place auto suggestions, Author/Agent --> Publisher auto suggestion, and possibly for bf:work entities Author/Agent---> Subject auto suggestion.

There is also the possibility to semi-intelligently auto complete or auto populate related works or to suggest possible super works in the bf:work entity description.

Give an example Work Title --> Subject auto suggest... Given a title in a work description, it may be possible to begin to auto populate the subject fields with suggested entities. If a user does not want these, they can simply delete it from the record. I was inspired by finto ai service (https://ai.finto.fi/ ) which could be preconfigured with id.loc.gov subjects in skos and then sent a title term for possible subject suggestions that are auto populated.

Agent/Author --> Subjects and Publishers Using database (perhaps it gets packaged in sql-lite for the editor) of fp-growth association rules, the interface can autosuggest a set of subjects and publishers from a provided entered agent or author.

Describe alternatives you've considered I have tried using the current drop down approach to select entities from QA. While doing one or two is not a problem, there are many drop downs in the Sinopia interface writ large. I think the entity selection is important and valuable for linked data work. I have observed catalogers making use of the QA drop downs extensively and I think it leads to more overhead in time to catalog with entity select than catalogers are accustomed. This is especially true for finding a geographic name and also when any QA list is not what the cataloger expects. If the Sinopia interface was able to semi-intelligently and semi-automated provide filled in terms, I think it would help streamline the cataloging in Sinopia and at the same time help support cataloger acceptance of an interface that is semi-inteligent.

michelleif commented 3 years ago

This sounds promising, and I think something @jermnelson may be interested in. There are some additional ideas for intelligent auto-population in a very old ticket from when we evaluated other systems, so I'm linked to that here for reference: https://github.com/LD4P/requirements/issues/10

jimfhahn commented 3 years ago

I had some time to explore the Annif software library for the title-->subject auto-population feature...

After loading LCSH vocabulary and training, the feature seems promising/feasible from a small test set of 2 million title-subject pairs from Penn. The data-sets are available here: https://github.com/jimfhahn/Annif-tutorial/tree/master/data-sets

...and commands to run the service: https://github.com/jimfhahn/Annif-tutorial/tree/master/exercises

jimfhahn commented 3 years ago

@jermnelson I started to work through an example fixture of annif subject lookups using discogs as an exemplar...

Wondering what you might suggest for where to place annif API? From what I can tell the annif api does not seem to be a good fit into apidoc.json due to the expectation of lookup.ld4l.org/ path, I did not find an api path there for annif, could I supply a link to the trained annif dockerfile on docker hub for a pilot test or maybe some other local setup you suggest .... ?

justinlittman commented 3 years ago

@jimfhahn -- Jeremy is on vacation, so his response may be delayed.

jermnelson commented 3 years ago

Hi @jimfhahn, catching up now. I think we'll want to create a separate configuration for the Annif API from the QA apidoc.json within the Sinopia Editor. I would be interested in seeing the Dockerfile you mentioned and I'll start investigating possible deployment to Sinopia's AWS cluster. Thanks!

jimfhahn commented 3 years ago

Hi @jermnelson, here is a rough development only docker image : https://hub.docker.com/repository/docker/jimfhahn/ivyplus-annif-api

It is based on the Annif Github tutorial, but when prepping to get the data into docker hub, I saw that there is an extensive production API compose file that can be configured for your production environment. I think it likely needs to be scoped/modified by DevOps professionals who have experience with production containers. I don't have the experience to know what might be needed for building the production API. Though a cursory read of the compose makes me believe it may be possible to sync the data from the image linked above.

Further into the production weeds: a ML system in production is something that will likely need sustained data monitoring however implemented. I am working on a coursera specialization now on MLOps -- Machine Learning Engineering for Production. Here is one telling image on the iterative nature of this work:

Screen Shot 2021-05-31 at 8 33 25 AM Screen Shot 2021-05-22 at 5 10 59 PM

Basically even once you have something as production, there is an iterative process of checking for data drift, vocabulary drift, and model retraining, among others. The more traditional production stuff applies, too. :)

Screen Shot 2021-05-31 at 8 48 18 AM
jimfhahn commented 3 years ago

Hi @jimfhahn, catching up now. I think we'll want to create a separate configuration for the Annif API from the QA apidoc.json within the Sinopia Editor. I would be interested in seeing the Dockerfile you mentioned and I'll start investigating possible deployment to Sinopia's AWS cluster. Thanks!

@jermnelson just to clarify if I am adding an API source like Annif API are you saying it should be a new file 'newapi.json' or add it to 'apidoc.json' ?

Also wondering if I should pursue Annif hosting from Penn IT or if it Annif is required to be hosted within a specific infrastructure(s) ...?

jimfhahn commented 2 years ago

Have been looking into various services that exist from LC -- the suggest2 service from Library of Congress can be configured for works/hubs searching, may be of use for parts of the Sinopia editor when adding related works in the Work Description:

https://id.loc.gov/resources/works/suggest2?=history

https://id.loc.gov/resources/hubs/suggest2?=history