biigle / core

:large_blue_circle: Application core of BIIGLE
https://biigle.de
GNU General Public License v3.0
12 stars 15 forks source link

Selection of labels via speech recognition #272

Open mzur opened 5 years ago

mzur commented 5 years ago

During a live video stream (#7) it might be useful to quickly select labels via speech recognition. There is the Web Speech API but unfortunately it doesn't seem to be usable quite yet. This blog post describes the situation in greater detail. I wasn't able to bring the speech recognition demo to work in either Chromium or Firefox. Maybe this situation will improve during MI2 and we are able to eventually implement this feature.

mzur commented 5 years ago

It works with Chrome (not Chromium) only because it needs to access the Google server for the speech recognition. This is only possible with an internet connection and the hard coded API key of Chrome. The demo with the speech color changer works with this setup.

The speech recognition might not work very well with the specialized species names that are often the label names. Maybe we need to put a fuzzy string matching mechanism on top of it to match the recognized speech to the most likely label at hand.

dlangenk commented 5 years ago

What about using synonyms like colors or numbers, which should be much easier to recognize. Also they are shorter.

mzur commented 5 years ago

Numbers are a good idea for the label favourites which are already associated with the 0-9 keys. Colors might be harder because you probably can't remember the names for more than ten colors very easily, let alone choose exact colors when labels are created.

mzur commented 3 years ago

There is a Tensorflow.js demo that works very well with numbers (even in Firefox without speech recognition API). This looks very promising and does not send data to Google servers.

mzur commented 3 years ago

I think there should also be a very prominent popup of the selected label (by speech) so the user knows if the correct label was selected. The label could pop up at the center of the annotation tool over the image/video and disappear after one or two seconds.

mzur commented 3 years ago

With the Tensorflow.js option, this becomes very viable. My idea:

dlangenk commented 2 years ago

In the tf.js samples there is a kind of library already for this purpose which seems fairly easy to use and quite okaish in performance. https://github.com/tensorflow/tfjs-models/tree/master/speech-commands However the implementation of the actual selection in Biigle, feedback, buttons etc. still remains.

mzur commented 2 years ago

Yep, this is what the demo linked above uses as well.

mzur commented 1 year ago

The magic SAM tool uses onnxruntime-web. If we implement speech recognition, we should try to use the same libraries.