biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 82 forks source link

Privacy violation by only offering online embeddings! #1057

Open Bardo-Konrad opened 5 months ago

Bardo-Konrad commented 5 months ago

Document Embeddings does not allow local models and therefore creates a privacy hazard.

As I don't assume that this was done due to malicious design by the Bioinformatics Lab at University of Ljubljana, Slovenia, you need to fix this and enable local open source models.

markotoplak commented 5 months ago

Thanks, we would also prefer to have a local option. Do you know of any small models that are easily pip-installable? Preferably not like 1GB dependency?

Bardo-Konrad commented 5 months ago

You could try Small Language Models like gemini Nano, orca-2-7b etc. and in general use spacy as in

# Install spacy
pip install -U spacy

# Download the small English model
python -m spacy download en_core_web_sm
import spacy

# Load the installed model
nlp = spacy.load("en_core_web_sm")

# Use the model
doc = nlp("This is a sentence.")
ajdapretnar commented 4 months ago

Spacy would be super beneficial for adding the named entity recognition option! Perhaps also a way to add Chinese tokenisation. Note that Spacy would not cover 17 languages that FastText does (Catalan, Croatian, Lithuanian, Macedonian, Ukrainian, Arabic, Azerbaijani, Bengali, Hindi, Tajik, Turkish, Norwegian Nynorsk, Nepali, Kazakh, Indonesian, Hungarian, Hebrew) or other 25 languages that multilingual SBERT covers. However, as an option, it would be great to have! Spacy's English model is 12 MB (the smallest model) + an added 11MB in Spacy dependency.