VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.46k stars 1.93k forks source link

Acceptable feature values types not well documented #2635

Open wmelton opened 3 years ago

wmelton commented 3 years ago

Description

Looking through all of the available vw documentation, there doesn't seem to be any clear documentation on all of the valid input value types for features (at least that I can find) or how exactly vw understands them under the covers if you move beyond string and ints/floats.

More thorough documentation of valid value types for features and how vw understand them would be helpful to the community.

Of significant note in my mind (due to exp increase in learned embeddings within the field over recent years) would be explanation of how vw hands arrays/lists of ints/floats as a feature? Signed vs unsigned features - does that matter?

If arrays/lists are acceptable value types for features, how are they treated?

Example, feature1=[0.3,-0.2,....,n] may be a BERT learned text embedding (a news title, etc.) - how does vw interpret this? Does it understand this may be itself a feature vector from some other ml model, or does it see it as a string? Does vw "explode" it in to an internally mapped one-hot-encoding (not ideal)? Etc.

VW does not throw an error if a list of ints/floats for the feature is used, but how vw understands it under the covers is unclear.

I'm happy to submit a pr on the documentation if someone can provide guidance on how these types are handled.

Link to Documentation Page

https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format

peterychang commented 3 years ago

VW will always treat a feature as a float, so there is no concept of an unsigned feature (an int is just treated as a floating point number)

I believe you can model a feature list in a separate namespace as a set of anonymous features @lokitoth Its been a while since I worked with some of the more advanced text features. Could you explain how this works?

lokitoth commented 3 years ago

The way to think about anonymous features is that the "hash" of the feature name is trivial. However, to try to avoid collisions, the index of the feature within the namespace is determined by a counter of anonymous features. In other words, if you have a namespace:

|ns1 :1.2 :3.1 :0.4

What you would get are effectively three "dense" features in that namespace with offsets = 0, 1, 2, respectively. If you look at https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/parse_example.cc#L184-L191, you will see that the final feature index (word_hash) is computed based on the channel_index (which is the hash of the namespace name), and the anonymous feature number (_anon).

I agree with @wmelton that the documentation should be updated to include information about this form of feature input.

wmelton commented 3 years ago

@lokitoth Thanks for your reply and thoughts here.

A couple of implications/questions raised here in my mind:

  1. How does this impact latent variable computation in contextual bandit scenarios? Meaning, if we have a User and Product namespace, we can use -q UP to learn interesting feature interactions. However, if we have some features that are learned embeddings (dense vectors) themselves from some other ML process, and those must become namespaces, I'm wondering about the efficacy of of vw to find optimal feature interactions without exponentially increasing needs for incredible data volumes to achieve accurate predictions. Meaning, -q UPn, where we have n additional namespaces for already vectorized embeddings inputs would seem to require much more data than -q UP to have meaningful predictions. Perhaps not though! I'm just not personally familiar enough with vw to know how this would be handled.

  2. Is my thinking about this completely wrong? Meaning, should I be using vectors as features at all? Many very advanced neural networks exist to generate embeddings from text that maintain strong contextual and semantic 'knowledge' that would seem far more useful than ngrams from a sequence of text. But I am not sure how vw's performance would compare to something like BERT embeddings for the same text sequence (sentence, title, phrase, etc.)