GokuMohandas / Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.
https://madewithml.com
MIT License
37.51k stars 5.95k forks source link

Which kind of model is better for keyword-set classification? #164

Closed guotong1988 closed 4 years ago

guotong1988 commented 4 years ago

There exists a similar task that is named text classification.

But I want to find a kind of model that the inputs are keyword set. And the keyword set is not from a sentence.

For example:

input ["apple", "pear", "water melon"] --> target class "fruit"
input ["tomato", "potato"] --> target class "vegetable"

Another example:

input ["apple", "Peking", "in summer"]  -->  target class "Chinese fruit"
input ["tomato", "New York", "in winter"]  -->  target class "American vegetable"
input ["apple", "Peking", "in winter"]  -->  target class "Chinese fruit"
input ["tomato", "Peking", "in winter"]  -->  target class "Chinese vegetable"

Thank you.

GokuMohandas commented 4 years ago

Hey @guotong1988 , you'll want to first gather enough data for the types of entities (fruit, vegetable etc.) that you care about. You can use an off-the-shelf set of embeddings (ex. GloVe) to train because these are common tokens and the embeddings for entities in the same class will already be clustered since they all used large, generic datasets to learn embeddings from.

In the second example, where you have labels like "Chinese fruit", you'll want to treat this as a multiclass classification problem (ex. output is [0, 1, 1, 0] instead of being one unique class [0, 1, 0, 0]. However, you can just make more classes like "fruit", "chinese fruit" but your model is going to start confusing classes because there will be a lot of overlap. You can also create two separate models to predict "fruit" and then "chinese" from the set of keywords but this is assuming every prediction has both labels.

Hope that helps.