Open mzur opened 5 years ago
It works with Chrome (not Chromium) only because it needs to access the Google server for the speech recognition. This is only possible with an internet connection and the hard coded API key of Chrome. The demo with the speech color changer works with this setup.
The speech recognition might not work very well with the specialized species names that are often the label names. Maybe we need to put a fuzzy string matching mechanism on top of it to match the recognized speech to the most likely label at hand.
What about using synonyms like colors or numbers, which should be much easier to recognize. Also they are shorter.
Numbers are a good idea for the label favourites which are already associated with the 0-9 keys. Colors might be harder because you probably can't remember the names for more than ten colors very easily, let alone choose exact colors when labels are created.
There is a Tensorflow.js demo that works very well with numbers (even in Firefox without speech recognition API). This looks very promising and does not send data to Google servers.
I think there should also be a very prominent popup of the selected label (by speech) so the user knows if the correct label was selected. The label could pop up at the center of the annotation tool over the image/video and disappear after one or two seconds.
With the Tensorflow.js option, this becomes very viable. My idea:
In the tf.js samples there is a kind of library already for this purpose which seems fairly easy to use and quite okaish in performance. https://github.com/tensorflow/tfjs-models/tree/master/speech-commands However the implementation of the actual selection in Biigle, feedback, buttons etc. still remains.
Yep, this is what the demo linked above uses as well.
The magic SAM tool uses onnxruntime-web. If we implement speech recognition, we should try to use the same libraries.
During a live video stream (#7) it might be useful to quickly select labels via speech recognition. There is the Web Speech API but unfortunately it doesn't seem to be usable quite yet. This blog post describes the situation in greater detail. I wasn't able to bring the speech recognition demo to work in either Chromium or Firefox. Maybe this situation will improve during MI2 and we are able to eventually implement this feature.