biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.7k stars 993 forks source link

Allow Neighbors to accept sparse data #6749

Open wvdvegte opened 4 months ago

wvdvegte commented 4 months ago

What's your use case? I want to use Neighbors to search a corpus of documents for items similar to one or more reference documents. Since Neighbors requires that Reference and Data have the same features, I have to apply either Text Embedding, Similarity Hashing or Topic Modeling in order to represent the corpora quantitatively. But for most ML tasks with text, I find Bag of Words usually producing more convincing results.

What's your proposed solution? Allow Neighbors to accept datasets with different features, at least when it comes to sparse data from Bag of Words. So, before computing distances, the words that are in Reference but not in Data are added to Data with value 0, and the other way around.

Are there any alternative solutions? Not that I'm aware of.

wvdvegte commented 4 months ago

There is an alternative solution, which is a bit cumbersome: Concatenate Reference and Data before Bag of Words (requires that they have more or less the same variables), separate after Bag of Words with Select Rows using some criterion that distinguishes Reference from Data, then connect Matching Data to the Reference input of Neighbors and Non-matching Data to the Data input. As I said, rather cumbersome but it works.

markotoplak commented 4 months ago

@wvdvegte, you could probably also use the Apply Domain widget.

But I agree, this should have been done automatically. We discussed this, and internally we should have applied the domain of the data onto the reference when comparing.

wvdvegte commented 4 months ago

Indeed, in my use case Apply Domain produces processable inputs for Neighbours, too. Although it keeps the text in the corpus, for every row it sets all variables that are not sparse, to either '?' or 'nan'. Is this intended behavior? If I'm correct, 'nan' means 'not a number', which doesn't make sense for variables that were never defined as numeric.