biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Data Sampler: handle bag of words #5942

Closed ajdapretnar closed 2 years ago

ajdapretnar commented 2 years ago

What's wrong?

I suspect Data Sampler doesn't work well with bag-of-words/sparse data. Not sure. Here's the original issue: https://github.com/biolab/orange3-text/issues/809

How can we reproduce the problem?

See https://github.com/biolab/orange3-text/issues/809. Could be just text-related, but I doubt it.

What's your environment?

djukicn commented 2 years ago

This doesn't seem to be Data Sampler's fault. Replacing it with Corpus Viewer and selecting a few documents will cause the same error. The issue is with how Corpus indexes BoW features. This will be fixed in orange3-text.

noahnovsak commented 2 years ago

Yes selecting documents through the select columns widget also causes the error, however sampling the corpus before BoW seems to avoid the issue.

PrimozGodec commented 2 years ago

The data sampler is not a problem here. It actually a topic modelling/ngram_corpus issue. Closing this issue since it must be loved in text via https://github.com/biolab/orange3-text/issues/809