Open nzhiltsov opened 7 years ago
did you ever figure this out?
I'm also interested if there's a way to add some other type of metadata as hints for training, and have this be more important. for example a topic. if sentences are within a topic, this should have a more weighted matching factor...
For adding a topic just add a word with the name of the topic. It will be associated with a vector that will participate to the construction of the model.
@pommedeterresautee that will add the topic, but i wanted to have the metadata have an extreme weighting on the classification. almost to the point that absence of that metadata would remove it from that category. I guess i could do that with some prefiltering... but wanted to keep a single pipeline.
I've spent some time on implementing this today, so far something terribly wrong in my loss function. Here's my repo where training data is treated as weighted. toy training data here and the weights trickle all the way down to where the loss and gradients are computed/applied.
My spike there breaks the tight coupling of the embedding and classification modes, introducing specific functions for handling classification. In a certain cheeky vein, it seems the original repo made it a point to conflate the two, whereas except for sharing some matrix util functions, there's very little to topically relate them other than the concurrent HOGWILD-inspired multi-threaded implementation and dimension of the embedding. You could implement the classification algorithm on top any word embeddings you have regardless of the fasttext-specific embedding model; so I think this actually improves modularity, but I doubt anyone would merge a PR for this specificity (that is, if it actually worked).
EDIT: I now realize this issue is about weighting data samples, which is not what I've been working towards here. Sorry for that ;) my false positive on finding a related open issue to prevent opening one.
I'm curious about this as well. Our dataset has 95% 'positive' (A) examples and only 5% 'negative' (B) examples -- the model often converges to just "always choose A".
Also curious if there is a clean solution to that problem implemented
One workaround it is to duplicate some training samples proportionally to the desired weights, but it only works for integer weights and it's not great.
https://github.com/Whadup/weightedFastText I tried it here, it's not up-to-date with the current fasttext version though
Hello, everyone. Is here any other update on this subect (Adding weights to specific subjects)?
I'm wondering how one can add sample weights to increase importance of some training examples in 'fasttext supervised' mode? See analogy e.g. 'sample_weight' array parameter while doing model.fit in Keras or compute accuracy_score in sklearn.