facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.85k stars 4.71k forks source link

Sample weights (weighted loss) for classification #278

Open nzhiltsov opened 7 years ago

nzhiltsov commented 7 years ago

I'm wondering how one can add sample weights to increase importance of some training examples in 'fasttext supervised' mode? See analogy e.g. 'sample_weight' array parameter while doing model.fit in Keras or compute accuracy_score in sklearn.

dcsan commented 7 years ago

did you ever figure this out?

I'm also interested if there's a way to add some other type of metadata as hints for training, and have this be more important. for example a topic. if sentences are within a topic, this should have a more weighted matching factor...

pommedeterresautee commented 7 years ago

For adding a topic just add a word with the name of the topic. It will be associated with a vector that will participate to the construction of the model.

dcsan commented 7 years ago

@pommedeterresautee that will add the topic, but i wanted to have the metadata have an extreme weighting on the classification. almost to the point that absence of that metadata would remove it from that category. I guess i could do that with some prefiltering... but wanted to keep a single pipeline.

matanox commented 6 years ago

I've spent some time on implementing this today, so far something terribly wrong in my loss function. Here's my repo where training data is treated as weighted. toy training data here and the weights trickle all the way down to where the loss and gradients are computed/applied.

My spike there breaks the tight coupling of the embedding and classification modes, introducing specific functions for handling classification. In a certain cheeky vein, it seems the original repo made it a point to conflate the two, whereas except for sharing some matrix util functions, there's very little to topically relate them other than the concurrent HOGWILD-inspired multi-threaded implementation and dimension of the embedding. You could implement the classification algorithm on top any word embeddings you have regardless of the fasttext-specific embedding model; so I think this actually improves modularity, but I doubt anyone would merge a PR for this specificity (that is, if it actually worked).

EDIT: I now realize this issue is about weighting data samples, which is not what I've been working towards here. Sorry for that ;) my false positive on finding a related open issue to prevent opening one.

Meekohi commented 5 years ago

I'm curious about this as well. Our dataset has 95% 'positive' (A) examples and only 5% 'negative' (B) examples -- the model often converges to just "always choose A".

cramdoulfa commented 4 years ago

Also curious if there is a clean solution to that problem implemented

One workaround it is to duplicate some training samples proportionally to the desired weights, but it only works for integer weights and it's not great.

Whadup commented 3 years ago

https://github.com/Whadup/weightedFastText I tried it here, it's not up-to-date with the current fasttext version though

carlos-rafael commented 1 year ago

Hello, everyone. Is here any other update on this subect (Adding weights to specific subjects)?