VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.46k stars 1.93k forks source link

Another long feedback #510

Closed trufanov-nok closed 8 years ago

trufanov-nok commented 9 years ago

Hi,

Below is my feedback for last 3 moths of VW usage. It's mostly feature requests, some ideas and only few non-critical bugs.

I. VW

I spent last 3 months competing in kaggle's avazu contest using only VW and end up 29th on private leaderboard. I suspect many VW users got better result, but I used only one model, no ensembles and was short of RAM, so it seems to be max what I can do. I've used a fork of VW made 3 months ago and implemented dozens of heuristics in it. Some of them may be interesting and few definitely worth discussion. On the other hand I realised that several bugs I found (mostly with new_mf) were already fixed and in the main branch recently so I just wasted a time with them thus I paid my price for hiding info till competition ends.

1. namspaces redefinition

The first thing I have implemented in my VW fork is --redefine parameter which allows me to rename any namespace or set of namespaces. I gave a new namespaces to each feature of original dataset taking their names were from [a-Z] interval. Then I declared a parameter --redefine which takes string argument in form of x:=n where x is a list of old namespaces, := is an operator and n is a new namespace to which they should be assigned. That allowed me to put something like following in my vw command line:

--redefine abcdefgihjklmnoprstquwz:=L --redefine abcdegf:=a --ignore L --q aa

In example above we assign all our namespaces to namespace L and then (order is matter) reassign some of them to namespace a. As all features got their own namespaces the namespace a will contain 7 features and L 23-7=16. Then namespace a is used to generate quadratic features.

Why it's convenient? I can now experiment with different features used for quadratic features generation by just addition or removal literals in --redefine abcdegf:=a. On practice I wrote a bash script which started with abcdefgihjklmnoprstquwz:=a as a baseline and used leave-one-out with validation to find features which removal could improve the result.

This parameter also allows you to do something like

--redefine abc:=a --ignore a --redefine def:=b --redefine klmn:=c -q bc

All namespaces which were not redefined will be used as before - as ordinary feature.

Another thing why I'm so enthusiastic about this function is because it's very easy to implement without any performance penalties. My implementation is described in appendix.

2. Simple combinations in -q and --cubic

Sometimes kaggle's dataset contains hashed data to hide sensitive info. Moreover - sometimes the column names are meaningless too and we can only guess what categorical feature it represents. In such circumstances you may often found something like -q aa or --cubic aaa in vw models. That's because there are too many features to specify -q for each pair and better to put all features in one namespace which combined with itself. And there is no any insight on how that namespace may be split to subsets to improve the score. It might be done by just trying all combinations with a script but it's easier to play with including\excluding features in namespace keeping -q aa than generating dozens -q for namespaces holding single feature. And we can rely on GD + regularization to get rid of useless pairs.
What I'm trying to persuade you is that i'm not stupid and sometimes -q xx and --cubic xxx (with equal x) params are used. And I even seen such command lines in some discussions and blogs.

Now, assuming -q aa -like params are used and perhaps widely, we can recall that -q ab and -q ba (for namespaces with single features) are equal as well as all permutations of abc for --cubic. Well, technically ab and ba will generate features with different addresses in regressor bcs of _quadraticconstant but if applied to all examples in dataset - the effect on avg loss shall be equal.

For example for -q aa where a contains features x and y we'll get all permutations: xx, xy, yx, yy. Although it's obvious that only simple combinations without repetitions have sense: xy. (yx we'll be equal, and xx, yy are equal to just x and just y).

If we consider -q aa --cubic aaa for namespace a with 30 features in it we'll get 30 + (-q) 30^2 + (--cubic) 30^3 features = 27930 features per example. And for simple combinations without repetitions we should get 30 + (-q) 30!/2!/28! + (--cubic) 30!/3!/27! features = 30 + 29*30/2 + 28*29*30/6 = 4525 features per example.

Usage of simple combinations instead of permutations in case where interacting namespaces are the same significantly increased speed of my model training (bcs of less number of features per example), speed of convergence of GD (less passes needed) and (if I recall right) the final model result too. Thus I believe simple combinations mode shall be enabled by default.

My current implementation is quite messy. Some ideas regarding better implementation will be discussed further.

3. Caching generated features

Currently generation of quadratic and cubic futures is done in GD::FOREACH_FEATURE template and they don't stored anywhere. In kaggle's contests datasets usually have dozens of columns and sometimes hundreds of them (Africa soil prediction competition had ~1200 examples with ~3500 features for each of them). Thus successful models are usually dealing with thousands of features and most of them are generated on fly in this template method. It's known that even in simple gd algorithm FOREACH_FEATURE is called several times. . I have implemented a simple caching for generated features by just storing them in a new namespace with a non-printed name (0x07 for example). The cached values are used if such namespace exists or generated and stored in it otherwise.

There may be another approach - creation of a new v_array for such features but it's no so convenient bcs of I.5. I haven't measured performance impact. It shouldn't be big if any except for some extreme datasets.

4. Features with higher orders of interaction

Thanks to I.2 my laptop performance allowed me to try features with more orders of interaction than 3 (cubic). It didn't improve my model (in I.5 I found that only few high order combinations may improve it), but it still was an interesting attempt. Keeping in mind I.2 and I.3 I would rewrite FOREACH_FEATURE template at all. What if we put feature generation code out of GD:: and keep as separate module as some other reductions call to this method too? We can also replace -q and --cubic parameters with only one (let it be -q ) and detect the order of feature interaction by the length of passed argument. So --cubic abc becomes -q abc (old --cubic may be still supported for compatibility). And my command line would be vw test.vw -q ab -q abc -q cde -q abbb for example.
Loops over namespaces for feature generation may be done with recursive calls. Literals in -q arguments may be sorted ascendantly on param processing step. In this case if there are letters which are repeated in argument several times (like -q abcb) they'll be in neighbour calls. So we can just check in recursive function if next namespace is same as previous and vw in simple combination mode then pass boolean indicating that loop shall start from i=index+1 instead of i=1.
I would also implement caching in it and make a template for _auditfeatures to generate and cache features in audit mode.

5. Regressor audit trick

Once I've decided to audit my model which training had already consumed 5 of 8 available Gb of RAM. Thus I couldn't do that due to lack of RAM without decreasing -b. So I've added a new mode to vw to get values of my regressor together with feature names. It works in a following way:

In such manner you can process regressor of any size and get its weights with corresponding feature names.

This approach requires addition of support for audit_data to FOREACH_FEATURE. It's as fast as test mode and takes just a few MBs more memory (feature names in audit data).
It aso may have early terminate mode as we a priori know number of non empty features (from save_load_reressor()) and can decrease this number every time feature gets magic number instead of weight. Work can be halted after this indicator reach 0. I didn't implement this enhancement.

Below is some unnecessary details:
On practice I've applied this to regressor obtained by training on 3.5mln dataset with ~9k features for each example (made with -q and --cubic). I don't remember how many examples I've processed but eventually I halted VW as output file increased up to 10Gb and I realised that I couldn't process bigger file in R anyway. It's also need to be said that my feature names were in following format: namespace_value. E.g |device_id device_id_eca8120f. Generated cubic features had names likesite_id_1984e20a^device_id_eca8120f^banner_pos_0. Thus I was able to cut out values parts from them with unix sed command. Then I loaded resulted <namespaces,weight> pairs in R and got much smaller factor for namespaces. Then I analysed means of namespaces weights and number of their occurrences to remove observations which were too rare. This allowed me to identify several synthetic namespaces (generated with --cubic) which had significantly bigger avg weight and I could improve my model by generating features with 4-5-6 interactions based on them. I didn't use any FOREACH_FEATURE modification for that. I've just added these synthetic features to dataset as new features and --cubic has added 2 additional levels of interaction to them.

6. Using approach above to get input file for libsvm?

Just a thought: I considered making a vw mode which allows generation of input file for libsvm. Didn't do it for real.

Vanilla LibSVM has similar format but doesn't support categorical features in input file. As I understood there are tools for LibSVM which can convert categorical data to a number format using hashing or perhaps dictionaries (didn't check how exactly it is done). But I wanted to try features generated by -q and --cubic too. It looks like there is no equivalent functionality in libsvm, What I was thinking to do is to make a mode in vw alike described in I.5. But there may be only one step - vw reads example, generates -q --cubic features and stores hashes of these features in output file in libSVM format. No any learning is done. That could be made just by writing a new function for FOREACH_FEATURE template. But I didn't give it a try as according to my estimations my 5gb input file will become 500gb with all cubic features inside. So it's just an idea, perhaps someone found this more useful than me,

7. Time spent in VW's output

Like II.3 I would request printing time that each outputted iteration took since the last printed out value. Let's say --time switch which is off by default to not break all 'make test' cases. It may looks like that:

Reading datafile = train_freq.vw
num sources = 1
average    since         example     example  current  current  current     time
loss       last          counter      weight    label  predict features    spent
0.775936   0.775936            1         1.0  -1.0000   0.1592       33 00:00:00
0.728623   0.681310            2         2.0  -1.0000  -0.0238       33 00:00:00
0.708393   0.688163            4         4.0  -1.0000  -0.0018       33 00:00:00
0.737389   0.766386            8         8.0  -1.0000   0.2023       33 00:00:01
0.796014   0.854638           16        16.0  -1.0000   0.5368       33 00:00:01
0.866274   0.936533           32        32.0  -1.0000   1.0178       33 00:00:01
0.947102   1.027930           64        64.0  -1.0000   1.0452       33 00:00:01
1.061943   1.176785          128       128.0  -1.0000   1.2128       33 00:00:01
1.208501   1.355059          256       256.0   1.0000   0.6986       33 00:00:01
1.211329   1.214156          512       512.0  -1.0000   0.5613       33 00:00:02
1.032622   0.853914         1024      1024.0  -1.0000  -0.3099       33 00:00:04
0.886021   0.739420         2048      2048.0  -1.0000   0.2083       33 00:00:08
0.768442   0.650864         4096      4096.0  -1.0000  -0.8474       33 00:00:16

The usecase behind this is following:

It seems that link functions are applied before bootstrap code and this affects its predictions. Example:

$ ../vowpalwabbit/vw  train-sets/rcv1_small.dat --bootstrap 20
...
average loss = 0.547866

$../vowpalwabbit/vw  train-sets/rcv1_small.dat --bootstrap 20 --link=logistic
...
average loss = 1.16499

Expected result: presence of --link=logistic shouldn't change average loss value.

9. Broken save_resume

There is a good option --save_resume. It allows me searching hyperparameters for each pass separately. Unfortunately it's broken. For example:

$ ../vowpalwabbit/vw  -k -c train-sets/rcv1_small.dat -f temp.model --passes 2 --holdout_off
...
average loss = 0.285058

$ ../vowpalwabbit/vw  -k -c train-sets/rcv1_small.dat -f temp1.model --save_resume --holdout_off
...
$ ../vowpalwabbit/vw  -k -c train-sets/rcv1_small.dat -i temp1.model --save_resume --holdout_off
..
average loss = 0.286118

I heard about that problem from stackoverflow's question which I was trying to answer but author disappeared without providing a dataset to test. Later I realised that this feature might be helpful for me too, so I reproduced, investigated and fixed (just for my dataset) this problem. It was months ago, so I might not recall all details, but the problem is in g.total_weight which is not restored in gd::save_load_online_state(). If you were able to correctly restore it - you'll get the correct average sum after two invocations of vw.
Then you'll find one more problem - you'll get incorrect loss in a very first avg loss output after loading from --save_resume model. Only for a first one. That's because old_weighted_examples not restored or something like that. Don't remember exactly. I never fixed the second bug as i had no time for that. Still it's a problem as for big files on pass2 or pass3 you'll get only one printed output (with default -P behaviour) and it's misleading to see broken loss there and correct avg loss in vw's summary test.
As for 1st problem - to fix it you shall change model file. I didn't do that, as my dataset didn't has feature or example weights, so I've just added this line g.total_weight = all.sd->weighted_examples; after save_load_online_state() call. In such dataset total_weight and weighted_examples are equal and sd->weighted_examples is restored from model file. But for train-sets/rcv1_small.dat dataset used as example above that's not true. Its weighted_examples == 1000 and total_weight is only 920. It seems because features have weights. If you try my code you'll get average loss = 0.285076 for command 3 which is still incorrect. So saving this value to the model file seems to be only robust way to fix the problem.

10. lrq in model file

Params passed with -lrq are stored in model file. When you load it for testing they'll be doubled if you forget to exclude your --lrq params from the command line. This duplicity make prediction completely wrong. In case of -q there is some fool protection for such case:

vw data -f model -q qq
vw -t data -i model
vw -t data -i model -q qq

Commands 2 and 3 we'll give you same result.

vw data -f model -lrq qq
vw -t data -i model
vw -t data -i model -lrq qq

Here command 3 returns wrong result different from command 2

11. Is parseFloat() too cautious?

I suspect that strtod() is unnecessary called in parseFloat() in parse_primitives.h in case the float ends with a new line bcs of https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/parse_primitives.h#L96 if (*p == ' ')//easy case succeeded. Perhaps it's better to use if (*p == ' ' || *p == '\n')//easy case succeeded.

I would also include '\t' as it's default separator when you join two file with Unix paste command

12. Multiple regressor initialisation

In initialize_regressor() in parse_regerssor.cc it's worth adding else between these blocks

  if (all.random_weights)
{...}
   if (all.random_positive_weights)
{...}
   if (all.initial_weight != 0.)
{...}

The random_positive_weights is set to true automatically with --new_mf mode in mf.cc. I found that these random values ( 0.1 * frand48() ) are too big for my dataset (9000 features per example). So I initialised regressor with help of --initial_weight with something like 1e-5 and got better convergence. But now regressor initialises twice. I would put else clauses between these 3 if's and reorder them so all.initial_weight comes first.

13. Outputting constant feature weight

I found it interesting and sometimes useful (to select better initial_weight value, for example) to print out weight of a constant feature which vw adds by default. The code is:

 #vw.h
 inline void print_constant_weight(vw& all)
  {
      float w = get_weight(all, constant, 0);
      cerr << "constant feature ";
      if (all.initial_constant != 0.)
      cerr <="(" << all.initial_constant << ") ";
      cerr << "weight = " << w << endl;
  }

 #usage
         if (all->add_constant)
            VW::print_constant_weight(*all);

I also print it out in case regressor was initialised from file just after that happens. So it may be displayed twice - before training pass and at its end.

14. Typos.

The help line for initial_weight says: --initial_weight arg Set all weights to an initial value of 1.

Shall " be to an initial value of arg." I suppose.

15. Online clustering

I believe VW needs some clustering algorithm. I made one for case where all features weights are equal to 1. It's not something I would push into master branch - it's not production quality. For example, I didn't implement save_load methods and used --passes 2 to switch into test mode on second pass and print out results. Worst of all, my kaggle model couldn't benefit from this results, so the whole algorithm may work incorrectly. Still I think online kmeans is a must for VW and would like to start a discussion about that. I'll publish the code bits with my notes later.

II. Functionality of utl\vw-csv2bin

It seems that script vw-csv2bin isn't flexible enough so every time I'm using my own code written in C++ or R. To make it a bit more usable I would ask to add a few functions to it:

Last two is quite difficult to do. If someone ever want to try I can suggest an argument format for these parameters.

III. Few notes on vw-hypersearch

Thanks to Ariel vw-hypersearch become one of my favourite tools and I wouldn't say there is much to improve here.

1.

I'm always use -v mode and found it very useful especially when you're just started to work with this script and wonder how it executes commands it -t mode. Now we also have -c mode and -v param may become even more useful for users. Unfortunately one won't know about its existence till look inside the code as it's not described in standard help output. Could you please add it there?

2.

Few times I wished vw-hypersearch supports simple grid search. Not sure how useful that but it seems to be simple to implement. Let's just add -g X param which will switch to the grid search mode with step X. If search interval is [a,b] then script could run vw for a, a+x, _a+2x.._a+nx, b. If a+x > b then just for a, b. If x is negative (seems that such could be passed in command line as -g=-1 or with a few other tricks in Unix) it shall run vw command for b, b+x, _b+n_x* ... a.

3.

It would be great to print out execution time for every vw launch at the end of the line:

trying 47 ......................................... 0.463323 [00:30:15] (best)
trying 33 ......................................... 0.469066 [00:28:56]
trying 56 ......................................... 0.462171 [00:34:15] (best)

The usecase for that is searching parameters which directly affect vw performance: --ngram, --lrq, --rank etc. Sometimes you could find that you are getting better results on validation step with increase of rank (for example) but that increase just not worth performance penalty you take. This may be triggered by --time param (off by default)

4.

The complicated one: vw-hypersearch may accept non-numeric params. It's technically not so difficult to iterate characters in string instead of numbers in a grid search. And there are vw params which accept characters, like --ngram, -q, --cubic. For example passing vw-hypersearch -g abcd vw test.vw -q %f shall launch

vw test.vw -q af
vw test.vw -q bf
vw test.vw -q cf
vw test.vw -q ef

5.

And the most complicated: support of multiple % in a grid mode. Each % may be treated as inner iteration loop. It has most practical sense in case of characters:

vw-hypersearch -g abc vw test.vw -q %%
vw test.vw -q aa
vw test.vw -q ab
vw test.vw -q ac
vw test.vw -q ba
...
vw test.vw -q cb
vw test.vw -q cc

Would be great if it won't be limited by number of inner loops and will support --cubic %%% for example.

6.

If 5 will be ever implemented I would request a --combinations key for it to switch on this mode and generate combinations instead of permutations. That's needed as '-q ab' and '-q ba' are equal. Imagine you have N characters or numbers and 3 inner loops: i1, i2, i3. In default mode loops are i1:{1..N}, i2:{1..N}, i3:{1..N} and in --simple_combinations they are i1:{1..N-2}, i2:{i1, N-1}, i3:{i2,N}. Thus vw-hypersearch -g abc vw test.vw -q %% --combinations will be just:

vw test.vw -q aa
vw test.vw -q ab
vw test.vw -q ac
vw test.vw -q bb
vw test.vw -q bc
vw test.vw -q cc

IV. utl\logistic

I was quite surprised to get this help output by launching

$../vowpal_wabbit/utl/logistic --help

logistic version [unknown] calling Getopt::Std::getopts (version 1.10 [paranoid]),
running under Perl version 5.20.1.

Usage: logistic [-OPTIONS [-MORE_OPTIONS]] [--] [PROGRAM_ARG1 ...]

The following single-character options are accepted:
        Boolean (without arguments): -0 -1

Options may be merged together.  -- stops processing of options.
  [Now continuing due to backward compatibility and excessive paranoia.
   See 'perldoc Getopt::Std' about $Getopt::Std::STANDARD_HELP_VERSION.]

It seems to be some Perl automatically generated summary which isn't informative enough as doesn't print the meaning of arguments, Real help page could be get by launching the script without parameters:

$../vowpal_wabbit/utl/logistic
Usage: logistic [Options] [files...]
    Maps label x (1st number on line) to logistic(x)
    Options:
        -0      Use logistic(x) -> [0  .. 1] range
        -1      Use logistic(x) -> [-1 .. 1] range (default)

Could we make ./logistic --help work as ./logistic to be more "unix-style"

APPENDIX

The bits of --redefine implementation

  1. Declare
bool redefine_some; 
unsigned char redefine[256];

in struct vw in global_data.h

  1. Add redefine option in parse_args.cc somewhere near keep/ignore params declaration. In my fork it would be: ("redefine", po::value< vector<string> >(), "redefine namespaces beginning with characters from string n... as namespace k. <arg> shall be in form 'n:=k'. Use ':=k' to redefine all namespaces.")
  2. Add following implementation after param declaration:
  all.redefine_some = false; // false by default

  if (vm.count("redefine"))
  {
  // all.redefine[i] will keep new namespace literal for i-th old namespace
  // by defalt it's i itself.

      for (size_t i = 0; i < 256; i++) // by default 
          all.redefine[i] = i;

          // note that declaration order is matter here
          // so --redefine :=L --redefine ab:=M  --ignore L  will ignore all except a and b under M
      vector< string > redefinitions = vm["redefine"].as< vector< string > >();
      for (vector<string>::iterator par = redefinitions.begin(); par != redefinitions.end(); par++){
          {
              string arg = *par;
              int operator_pos = -100;
              bool operator_found = false;
              unsigned char new_namespace = ' ';

              // let's find operator ':=' and new namespace after it

              for (size_t i = 0; i < arg.length(); i++)
              {
                  if (arg[i] == ':')
                      operator_pos = i;
                  else if ( (arg[i] == '=') && (operator_pos == (int)i-1) )
                      operator_found = true;
                  else if (operator_found)
                  {
                      new_namespace = arg[i];
                      break;
                  }
              }

              if (!arrow_found || new_namespace == ' ')
              {
                  cerr << "argument of --redefine is malformed. Valid format is n:=k" << endl;
                  throw exception();
              }

              all.redefine_some = true; //at least one valid redefinition rule is found

              if (operator_pos != 0)
                  for (size_t i = 0; i < (size_t) arr_pos; i++)
                      all.redefine[(size_t)arg[i]] = new_namespace; // namespaces from substr before operator are redefined
              else
                  for (size_t i = 0; i < 256; i++)
                      all.redefine[i] = new_namespace; //substr is empty - all nammespaces must be redefined (case of ':=L')

          }
      }
  }
  1. In nameSpaceInfo() in parse_example.cc After index = (unsigned char)(*reading_head); add this line: if (all->redefine_some) index = all->redefine[index];
  2. There is no need to do something for case of reading from cache file as namespaces stored there will be already redefined. On the other hand this makes impossible usage of cache file in automatic feature search with --redefine.
arielf commented 9 years ago

Hi Alex,

This is very useful to read, and thanks for all the great work.

You may get more success in getting progress on some/any of the above by separating it to separate smaller issues. Even more so by sending patches/pull-requests. For example, the fix for --save_resume has been seen by others but never fully understood. It would be great to get it in a separate pull-request.

Thanks.

trufanov-nok commented 9 years ago

Hi Ariel, There are only few bullet points I can address without John's bless. For example, fixing --save_resume require changes in model file format and perhaps that's so undesirable that John's would prefer to not fix it at all. Also I can patch only VW code, but not its scripts. Thus I would like to wait a green light on each of listed above before doing something to avoid unnecessary work. And I would prefer to discuss all that in one thread referring to proposals by their numbers.

JohnLangford commented 9 years ago

(1) seems is syntactic sugar, right? It provides no unduplicated functionality, but it does make feature manipulations easier.

For (2) I've used this myself and it is quite common for it to be beneficial. The implementation is crucial though. Basically, you need to detect whether it's a simple case in the core foreach_feature() template and then use a different traversal process.

For (3), I'm unclear on whether this is desirable computationally. A long time ago (things may be different now), I timed explicitly generating quadratics vs. generating them on the fly and found them somewhat faster to generate on the fly. Actually caching the quadratics features would be even worse, because you would blow up the size of the cachefile substantially.

For (4), I like the idea of only have one flag for outer product. However, I'm not sure that supporting higher order interactions than cubic is worthwhile, because the space explodes so quickly. I have only seen higher than 3rd order terms be useful in a greedy search context as per the polynomial learning.

For (5), this seems nice for really big files.

For (6), I believe that --print almost does what you want already. It's a trivial modification, and seems worthwhile if you want to do it.

For (7), I have no real opinion about whether or not --time would be helpful.

For (8), bootstrap is problematic at it's current location in the reduction order. The fundamental problem is that there are at least two uses of bootstrap. One use is creating a diversity of predictors then taking the average to get robustness and possibly improve performance.
Another use case comes up with multiclass predictions, where averaging makes no sense, but you would prefer to predict according to a plurality. What should be done? Probably these two kinds of bootstrap should invoke different setup functions which appear at different places in the reduction stack.

For (9), fixing save_resume seems desirable.

For (10), perhaps Paul (cced) should comment.

11,12,14, already merged.

For (13), I don't understand the usage mode/desirability over --audit?

For (15), what is the usage mode you have in mind for online k-means?
Is it just that this is expected?

utl\logistic should probably go away, because it's supported directly by vw.

Looking through things my priorities would be: Bugfixes: 2, 8, 9 Plausibly useful: 1, 5, 6

For the others, please respond to comments.

-John

On 02/10/2015 06:14 AM, Alexander Trufanov wrote:

Hi,

Below is my feedback for last 3 moths of VW usage. It's mostly feature requests, some ideas and only few non-critical bugs.

I. VW

I spent last 3 months competing in kaggle's avazu http://www.kaggle.com/c/avazu-ctr-prediction contest using only VW and end up 31st on private leaderboard. I suspect many VW users got better result, but I used only one model, no ensembles and was short of RAM, so it seems to be max what I can do. I've used a fork of VW made 3 months ago and implemented dozens of heuristics in it. Some of them may be interesting and few definitely worth discussion. On the other hand I realised that several bugs I found (mostly with new_mf) were already fixed and in the main branch recently so I just wasted a time with them thus I paid my price for hiding info till competition ends.

1. namspaces redefinition

The first thing I have implemented in my VW fork is /--redefine/ parameter which allows me to rename any namespace or set of namespaces. I gave a new namespaces to each feature of original dataset taking their names were from [a-Z] interval. Then I declared a parameter /--redefine/ which takes string argument in form of /x:=n/ where /x/ is a list of old namespaces, /:=/ is an operator and /n/ is a new namespace to which they should be assigned. That allowed me to put something like following in my vw command line:

|--redefine abcdefgihjklmnoprstquwz:=L --redefine abcdegf:=a --ignore L --q aa|

In example above we assign all our namespaces to namespace /L/ and then (order is matter) reassign some of them to namespace /a/. As all features got their own namespaces the namespace /a/ will contain 7 features and /L/ 23-7=16. Then namespace a is used to generate quadratic features.

Why it's convenient? I can now experiment with different features used for quadratic features generation by just addition or removal literals in |--redefine abcdegf:=a|. On practice I wrote a bash script which started with |abcdefgihjklmnoprstquwz:=a| as a baseline and used leave-one-out with validation to find features which removal could improve the result.

This parameter also allows you to do something like

|--redefine abc:=a --ignore a --redefine def:=b --redefine klmn:=c -q bc|

All namespaces which were not redefined will be used as before - as ordinary feature.

Another thing why I'm so enthusiastic about this function is because it's very easy to implement without any performance penalties. My implementation is described in appendix.

2. Simple combinations in -q and --cubic

Sometimes kaggle's dataset contains hashed data to hide sensitive info. Moreover - sometimes the column names are meaningless too and we can only guess what categorical feature it represents. In such circumstances you may often found something like |-q aa| or |--cubic aaa| in vw models. That's because there are too many features to specify |-q| for each pair and better to put all features in one namespace which combined with itself. And there is no any insight on how that namespace may be split to subsets to improve the score. It might be done by just trying all combinations with a script but it's easier to play with including\excluding features in namespace keeping |-q aa| than generating dozens |-q| for namespaces holding single feature. And we can rely on GD + regularization to get rid of useless pairs.

What I'm trying to persuade you is that i'm not stupid and sometimes |-q xx| and |--cubic xxx| (with equal x) params are used. And I even seen such command lines in some discussions and blogs.

Now, assuming |-q aa| -like params are used and perhaps widely, we can recall that |-q ab| and |-q ba| (for namespaces with single features) are equal as well as all permutations of /abc/ for |--cubic|. Well, technically /ab/ and /ba/ will generate features with different addresses in regressor bcs of /quadratic_constant/ but if applied to all examples in dataset - the effect on avg loss shall be equal.

For example for |-q aa| where /a/ contains features /x/ and /y/ we'll get all permutations: /xx/, /xy/, /yx/, /yy/. Although it's obvious that only simple combinations without repetitions have sense: /xy/. (/yx/ we'll be equal, and /xx/, /yy/ are equal to just /x/ and just /y/).

If we consider |-q aa --cubic aaa| for namespace /a/ with 30 features in it we'll get |30 + (-q) 30^2 + (--cubic) 30^3 features = 27930| features per example. And for simple combinations without repetitions we should get |30 + (-q) 30!/2!/28! + (--cubic) 30!/3!/27! features = 30 + 29_30/2 + 28_29*30/6 = 4525| features per example.

Usage of simple combinations instead of permutations in case where interacting namespaces are the same significantly increased speed of my model training (bcs of less number of features per example), speed of convergence of GD (less passes needed) and (if I recall right) the final model result too. Thus I believe simple combinations mode shall be enabled by default.

My current implementation is quite messy. Some ideas regarding better implementation will be discussed further.

3. Caching generated features

Currently generation of quadratic and cubic futures is done in |GD::FOREACH_FEATURE| template and they don't stored anywhere. In kaggle's contests datasets usually have dozens of columns and sometimes hundreds of them (Africa soil prediction competition had ~1200 examples with ~3500 features for each of them). Thus successful models are usually dealing with thousands of features and most of them are generated on fly in this template method. It's known that even in simple gd algorithm |FOREACH_FEATURE| is called several times. . I have implemented a simple caching for generated features by just storing them in a new namespace with a non-printed name (0x07 for example). The cached values are used if such namespace exists or generated and stored in it otherwise.

There may be another approach - creation of a new v_array for such features but it's no so convenient bcs of I.5. I haven't measured performance impact. It shouldn't be big if any except for some extreme datasets.

4. Features with higher orders of interaction

Thanks to I.2 my laptop performance allowed me to try features with more orders of interaction than 3 (cubic). It didn't improve my model (in I.5 I found that only few high order combinations may improve it), but it still was an interesting attempt. Keeping in mind I.2 and I.3 I would rewrite |FOREACH_FEATURE| template at all. What if we put feature generation code out of |GD::| and keep as separate module as some other reductions call to this method too? We can also replace |-q| and |--cubic| parameters with only one (let it be |-q| ) and detect the order of feature interaction by the length of passed argument. So |--cubic abc| becomes |-q abc| (old --cubic may be still supported for compatibility). And my command line would be |vw test.vw -q ab -q abc -q cde -q abbb| for example.

Loops over namespaces for feature generation may be done with recursive calls. Literals in |-q| arguments may be sorted ascendantly on param processing step. In this case if there are letters which are repeated in argument several times (like |-q abcb|) they'll be in neighbour calls. So we can just check in recursive function if next namespace is same as previous and vw in simple combination mode then pass boolean indicating that loop shall start from |i=index+1| instead of |i=1|.

I would also implement caching in it and make a template for /audit_features/ to generate and cache features in audit mode.

5. Regressor audit trick

Once I've decided to audit my model which training had already consumed 5 of 8 available Gb of RAM. Thus I couldn't do that due to lack of RAM without decreasing -b. So I've added a new mode to vw to get values of my regressor together with feature names. It works in a following way:

  • Run vw in a train mode as usual and save regressor to a file.
  • Run vw in new audit mode with |-t|, same training file (cache can't be used) and initialise regressor with saved value (|-i|). Also audit file shall be specified to save results.
  • VW reads examples from train file it has already seen and generates same feature hashes including |-q --cubic|. I've wrote another function for |FOREACH_FEATURE| template which takes feature address and check if it's weight isn't equal to magic number in regressor. If it's not - this weight is written to the output file among with feature name and then weight is replaced with magic number. If weight is equal to magic number - then this feature already was printed out and shall be ignored. As we are using same training file I can be sure that all non-null features in regressor sooner or later will be encountered. And using magic number (/MAX_FLOAT/, as I remember) I'm making sure feature weight will be printed out only once.

In such manner you can process regressor of any size and get its weights with corresponding feature names.

This approach requires addition of support for audit_data to |FOREACH_FEATURE|. It's as fast as test mode and takes just a few MBs more memory (feature names in audit data).

It aso may have /early terminate/ mode as we a priori know number of non empty features (from |save_load_reressor()|) and can decrease this number every time feature gets magic number instead of weight. Work can be halted after this indicator reach 0. I didn't implement this enhancement.

Below is some unnecessary details:

On practice I've applied this to regressor obtained by training on 3.5mln dataset with ~9k features for each example (made with |-q| and |--cubic|). I don't remember how many examples I've processed but eventually I halted VW as output file increased up to 10Gb and I realised that I couldn't process bigger file in R anyway. It's also need to be said that my feature names were in following format: namespace_value. E.g ||device_id device_id_eca8120f|. Generated cubic features had names like|site_id_1984e20a^device_id_eca8120f^banner_pos_0|. Thus I was able to cut out values parts from them with unix /sed/ command. Then I loaded resulted // pairs in R and got much smaller factor for namespaces. Then I analysed means of namespaces weights and number of their occurrences to remove observations which were too rare. This allowed me to identify several synthetic namespaces (generated with --cubic) which had significantly bigge r avg we ight and I could improve my model by generating features with 4-5-6 interactions based on them. I didn't use any |FOREACH_FEATURE| modification for that. I've just added these synthetic features to dataset as new features and |--cubic| has added 2 additional levels of interaction to them.

6. Using approach above to get input file for libsvm?

Just a thought: I considered making a vw mode which allows generation of input file for libsvm. Didn't do it for real.

Vanilla LibSVM has similar format but doesn't support categorical features in input file. As I understood there are tools for LibSVM which can convert categorical data to a number format using hashing or perhaps dictionaries (didn't check how exactly it is done). But I wanted to try features generated by |-q| and |--cubic| too. It looks like there is no equivalent functionality in libsvm, What I was thinking to do is to make a mode in vw alike described in I.5. But there may be only one step - vw reads example, generates |-q| |--cubic| features and stores hashes of these features in output file in libSVM format. No any learning is done. That could be made just by writing a new function for FOREACH_FEATURE template. But I didn't give it a try as according to my estimations my 5gb input file will become 500gb with all cubic features inside. So it's just an idea, perhaps someone found this more useful than me,

7. Time spent in VW's output

Like II.3 I would request printing time that each outputted iteration took since the last printed out value. Let's say |--time| switch which is off by default to not break all 'make test' cases. It may looks like that:

|Reading datafile = train_freq.vw num sources = 1 average since example example current current current time loss last counter weight label predict features spent 0.775936 0.775936 1 1.0 -1.0000 0.1592 33 00:00:00 0.728623 0.681310 2 2.0 -1.0000 -0.0238 33 00:00:00 0.708393 0.688163 4 4.0 -1.0000 -0.0018 33 00:00:00 0.737389 0.766386 8 8.0 -1.0000 0.2023 33 00:00:01 0.796014 0.854638 16 16.0 -1.0000 0.5368 33 00:00:01 0.866274 0.936533 32 32.0 -1.0000 1.0178 33 00:00:01 0.947102 1.027930 64 64.0 -1.0000 1.0452 33 00:00:01 1.061943 1.176785 128 128.0 -1.0000 1.2128 33 00:00:01 1.208501 1.355059 256 256.0 1.0000 0.6986 33 00:00:01 1.211329 1.214156 512 512.0 -1.0000 0.5613 33 00:00:02 1.032622 0.853914 1024 1024.0 -1.0000 -0.3099 33 00:00:04 0.886021 0.739420 2048 2048.0 -1.0000 0.2083 33 00:00:08 0.768442 0.650864 4096 4096.0 -1.0000 -0.8474 33 00:00:16

The usecase behind this is following:

  • You may observe performance changes with change of hyperparameter value (for example |--ngram| or |-q ab| in comparison with |-q ac|) and early terminate if you realise the training speed becoming unacceptable. In general - same may be achieved by launching |time vw test.vw --examples N| but it's not so convenient and require guessing of suitable N.
  • Some training modes may require a non-linear time for their work. At least |--ksvm| seems to slow down non-linear with increase of support vectors count. Well, I couldn't ever make svm work, so may be wrong here.
    1. Bug in --bootstrap

It seems that link functions are applied before bootstrap code and this affects its predictions. Example:

|$ ../vowpalwabbit/vw train-sets/rcv1_small.dat --bootstrap 20 ... average loss = 0.547866

$../vowpalwabbit/vw train-sets/rcv1_small.dat --bootstrap 20 --link=logistic ... average loss = 1.16499 |

Expected result: presence of |--link=logistic| shouldn't change average loss value.

9. Broken save_resume

There is a good option |--save_resume|. It allows me searching hyperparameters for each pass separately. Unfortunately it's broken. For example:

|$ ../vowpalwabbit/vw -k -c train-sets/rcv1_small.dat -f temp.model --passes 2 --holdout_off ... average loss = 0.285058

$ ../vowpalwabbit/vw -k -c train-sets/rcv1_small.dat -f temp1.model --save_resume --holdout_off ... $ ../vowpalwabbit/vw -k -c train-sets/rcv1_small.dat -i temp1.model --save_resume --holdout_off .. average loss = 0.286118 |

I heard about that problem from stackoverflow's question which I was trying to answer but author disappeared without providing a dataset to test. Later I realised that this feature might be helpful for me too, so I reproduced, investigated and fixed (just for my dataset) this problem. It was months ago, so I might not recall all details, but the problem is in |g.total_weight| which is not restored in |gd::save_load_online_state()|. If you were able to correctly restore it - you'll get the correct average sum after two invocations of vw.

Then you'll find one more problem - you'll get incorrect loss in a very first avg loss output after loading from |--save_resume| model. Only for a first one. That's because |old_weighted_examples| not restored or something like that. Don't remember exactly. I never fixed the second bug as i had no time for that. Still it's a problem as for big files on pass2 or pass3 you'll get only one printed output (with default |-P| behaviour) and it's misleading to see broken loss there and correct avg loss in vw's summary test.

As for 1st problem - to fix it you shall change model file. I didn't do that, as my dataset didn't has feature or example weights, so I've just added this line |g.total_weight = all.sd->weighted_examples;| after |save_load_online_state()| call. In such dataset |total_weight| and |weighted_examples| are equal and |sd->weighted_examples| is restored from model file. But for |train-sets/rcv1_small.dat| dataset used as example above that's not true. Its |weighted_examples == 1000| and |total_weight| is only 920. It seems because features have weights. If you try my code you'll get |average loss = 0.285076| for command 3 which is still incorrect. So saving this value to the model file seems to be only robust way to fix the problem.

10. lrq in model file

Params passed with |-lrq| are stored in model file. When you load it for testing they'll be doubled if you forget to exclude your |--lrq| params from the command line. This duplicity make prediction completely wrong. In case of |-q| there is some fool protection for such case:

vw data -f model -q qq vw -t data -i model vw -t data -i model -q qq

Commands 2 and 3 we'll give you same result.

vw data -f model -lrq qq vw -t data -i model vw -t data -i model -lrq qq

Here command 3 returns wrong result different from command 2

11. Is parseFloat() too cautious?

I suspect that |strtod()| is unnecessary called in |parseFloat()| in |parse_primitives.h| in case the float ends with a new line bcs of https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/parse_primitives.h#L96 |if (_p == ' ')//easy case succeeded.| Perhaps it's better to use |if (_p == ' ' || *p == '\n')//easy case succeeded.|

I would also include |'\t'| as it's default separator when you join two file with Unix /paste/ command

12. Multiple regressor initialisation

In |initialize_regressor()| in |parse_regerssor.cc| it's worth adding else between these blocks

if (all.random_weights) {...} if (all.random_positive_weights) {...} if (all.initial_weight != 0.) {...}

The |random_positive_weights| is set to true automatically with |--new_mf| mode in |mf.cc|. I found that these random values ( |0.1 * frand48()| ) are too big for my dataset (9000 features per example). So I initialised regressor with help of |--initial_weight| with something like 1e-5 and got better convergence. But now regressor initialises twice. I would put /else/ clauses between these 3 /if/'s and reorder them so |all.initial_weight| comes first.

13. Outputting constant feature weight

I found it interesting and sometimes useful (to select better |initial_weight| value, for example) to print out weight of a constant feature which vw adds by default. The code is:

| #vw.h inline void print_constant_weight(vw& all) { float w = get_weight(all, constant, 0); cerr << "constant feature "; if (all.initial_constant != 0.) cerr <="(" << all.initial_constant << ") "; cerr << "weight = " << w << endl; }

usage

      if (all->add_constant)
         VW::print_constant_weight(*all);

|

I also print it out in case regressor was initialised from file just after that happens. So it may be displayed twice - before training pass and at its end.

14. Typos.

The help line for initial_weight says: |--initial_weight arg Set all weights to an initial value of 1.|

Shall " be to an initial value of arg." I suppose.

15. Online clustering

I believe VW needs some clustering algorithm. I made one for case where all features weights are equal to 1. It's not something I would push into master branch - it's not production quality. For example, I didn't implement |save_load| methods and used |--passes 2| to switch into test mode on second pass and print out results. Worst of all, my kaggle model couldn't benefit from this results, so the whole algorithm may work incorrectly. Still I think online kmeans is a must for VW and would like to start a discussion about that. I'll publish the code bits with my notes later.

II. Functionality of utl\vw-csv2bin

It seems that script vw-csv2bin isn't flexible enough so every time I'm using my own code written in C++ or R. To make it a bit more usable I would ask to add a few functions to it:

  • A param to skip the header line which may be in input file.
  • A param to identify column with tag value and code to correctly process it. *

    A param that allows to ignore some columns. Would be better to have 2: |--keep| and |--ignore| like in vw. And would be great to have a support of intervals, like |--ignore 12:50| - use all columns except starting from 12 to 50. Of course, this will require accepting several values, like |--ngram 1 --ngram 6q| in vw.

    *

    A params to specify namespaces for columns. This is a most tricky one.

  • Something that allows to convert numerical data into categorical by adding suffix to value

Last two is quite difficult to do. If someone ever want to try I can suggest an argument format for these parameters.

III. Few notes on vw-hypersearch

Thanks to Ariel vw-hypersearch become one of my favourite tools and I wouldn't say there is much to improve here.

1.

I'm always use |-v| mode and found it very useful especially when you're just started to work with this script and wonder how it executes commands it |-t| mode. Now we also have |-c| mode and |-v| param may become even more useful for users. Unfortunately one won't know about its existence till look inside the code as it's not described in standard help output. Could you please add it there?

2.

Few times I wished vw-hypersearch supports simple grid search. Not sure how useful that but it seems to be simple to implement. Let's just add |-g X| param which will switch to the grid search mode with step /X/. If search interval is /[a,b]/ then script could run vw for /a/, /a+x/, /a+2/x/../a+n/x/, /b/. If /a+x > b/ then just for /a/, /b/. If /x/ is negative (seems that such could be passed in command line as |-g=-1| or with a few other tricks in Unix) it shall run vw command for /b/, /b+x/, /b+n/x* ... /a/.

3.

It would be great to print out execution time for every vw launch at the end of the line:

trying 47 ......................................... 0.463323 00:30:15 trying 33 ......................................... 0.469066 [00:28:56] trying 56 ......................................... 0.462171 00:34:15

The usecase for that is searching parameters which directly affect vw performance: |--ngram|, |--lrq|, |--rank| etc. Sometimes you could find that you are getting better results on validation step with increase of rank (for example) but that increase just not worth performance penalty you take. This may be triggered by |--time| param (off by default)

4.

The complicated one: vw-hypersearch may accept non-numeric params. It's technically not so difficult to iterate characters in string instead of numbers in a grid search. And there are vw params which accept characters, like |--ngram|, |-q|, |--cubic|. For example passing |vw-hypersearch -g abcd vw test.vw -q %f| shall launch

vw test.vw -q af vw test.vw -q bf vw test.vw -q cf vw test.vw -q ef
5.

And the most complicated: support of multiple /%/ in a grid mode. Each /%/ may be treated as inner iteration loop. It has most practical sense in case of characters:

vw-hypersearch -g abc vw test.vw -q %% vw test.vw -q aa vw test.vw -q ab vw test.vw -q ac vw test.vw -q ba ... vw test.vw -q cb vw test.vw -q cc

Would be great if it won't be limited by number of inner loops and will support |--cubic %%%| for example.

6.

If 5 will be ever implemented I would request a |--combinations| key for it to switch on this mode and generate combinations instead of permutations. That's needed as '-q ab' and '-q ba' are equal. Imagine you have N characters or numbers and 3 inner loops: /i1/, /i2/, /i3/. In default mode loops are /i1:{1..N}/, /i2:{1..N}/, /i3:{1..N}/ and in |--simple_combinations| they are /i1:{1..N-2}/, /i2:{i1, N-1}/, /i3:{i2,N}/. Thus |vw-hypersearch -g abc vw test.vw -q %% --combinations| will be just:

vw test.vw -q aa vw test.vw -q ab vw test.vw -q ac vw test.vw -q bb vw test.vw -q bc vw test.vw -q cc

IV. utl\logistic

I was quite surprised to get this help output by launching

|$../vowpal_wabbit/utl/logistic --help

logistic version [unknown] calling Getopt::Std::getopts (version 1.10 [paranoid]), running under Perl version 5.20.1.

Usage: logistic [-OPTIONS [-MORE_OPTIONS]] [--] [PROGRAM_ARG1 ...]

The following single-character options are accepted: Boolean (without arguments): -0 -1

Options may be merged together. -- stops processing of options. [Now continuing due to backward compatibility and excessive paranoia. See 'perldoc Getopt::Std' about $Getopt::Std::STANDARD_HELP_VERSION.] |

It seems to be some Perl automatically generated summary which isn't informative enough as doesn't print the meaning of arguments, Real help page could be get by launching the script without parameters:

$../vowpal_wabbit/utl/logistic Usage: logistic [Options] [files...] Maps label x (1st number on line) to logistic(x) Options: -0 Use logistic(x) -> [0 .. 1] range -1 Use logistic(x) -> [-1 .. 1] range (default)

Could we make |./logistic --help| work as |./logistic| to be more "unix-style"

APPENDIX

The bits of |--redefine| implementation

  1. Declare
bool redefine_some; unsigned char redefine[256];

in |struct vw| in |global_data.h|

1.

Add redefine option in |parse_args.cc| somewhere near keep/ignore
params declaration. In my fork it would be:
|("redefine", po::value< vector<string> >(), "redefine namespaces
beginning with characters from string n... as namespace k. <arg>
shall be in form 'n:=k'. Use ':=k' to redefine all namespaces.")|

2.

Add following implementation after param declaration:

| all.redefine_some = false; // false by default

if (vm.count("redefine")) { // all.redefine[i] will keep new namespace literal for i-th old namespace // by defalt it's i itself.

   for (size_t i = 0; i < 256; i++) // by default
       all.redefine[i] = i;

       // note that declaration order is matter here
       // so --redefine :=L --redefine ab:=M  --ignore L  will ignore all except a and b under M
   vector< string > redefinitions = vm["redefine"].as< vector< string > >();
   for (vector<string>::iterator par = redefinitions.begin(); par != redefinitions.end(); par++){
       {
           string arg = *par;
           int operator_pos = -100;
           bool operator_found = false;
           unsigned char new_namespace = ' ';

           // let's find operator ':=' and new namespace after it

           for (size_t i = 0; i < arg.length(); i++)
           {
               if (arg[i] == ':')
                   operator_pos = i;
               else if ( (arg[i] == '=') && (operator_pos == (int)i-1) )
                   operator_found = true;
               else if (operator_found)
               {
                   new_namespace = arg[i];
                   break;
               }
           }

           if (!arrow_found || new_namespace == ' ')
           {
               cerr << "argument of --redefine is malformed. Valid format is n:=k" << endl;
               throw exception();
           }

           all.redefine_some = true; //at least one valid redefinition rule is found

           if (operator_pos != 0)
               for (size_t i = 0; i < (size_t) arr_pos; i++)
                   all.redefine[(size_t)arg[i]] = new_namespace; // namespaces from substr before operator are redefined
           else
               for (size_t i = 0; i < 256; i++)
                   all.redefine[i] = new_namespace; //substr is empty - all nammespaces must be redefined (case of ':=L')

       }
   }

} |

1.

In |nameSpaceInfo()| in |parse_example.cc|
After
|index = (unsigned char)(*reading_head);|
add this line:
|if (all->redefine_some) index = all->redefine[index];|

2.

There is no need to do something for case of reading from cache
file as namespaces stored there will be already redefined. On the
other hand this makes impossible usage of cache file in automatic
feature search with |--redefine|.

— Reply to this email directly or view it on GitHub https://github.com/JohnLangford/vowpal_wabbit/issues/510.

trufanov-nok commented 9 years ago

Ok, prioritised first: (1) [redefine tag] Are you ok with param name and operator? It could be --rename_ns or whatever instead of redefine. As for operator (":=") - it could be any 2 chars except for '>','|','<','#'.

(2) [Simple combinations in -q and --cubic] I would add a new bool for this mode in vw class, turn it on by default and add a parameter (--combinations 0, --simple_combinations 0 or flag --permutations) to switch it off if someone ever need that. If you are ok pls choose the param name. Also, I agree that implementation matters. This point is linked with (3),(4) and (5) and I'll try to advocate them first.

(8) [bootstrap problem]

I've tried to fix this today among with 3 pull requests sent today but realised its complexity. I can't refactor bootstrap usage - I have no comprehensive vision of vw architecture plus current reduction setup wasn't implemented in my old fork. But I think out 2 dumb ways to fix for it. a. Bad one - detect usage of link function and reverse its effect before averaging prediction in bootstrap code. b. Better one - keep reference to link function somewhere in shared data and store original prediction in example class just before link is applied to it. In bs we can use these unprocessed predictions and apply right link function at the end. I think second one might be a temporary solution.

(9) save_resume problem.

The tough place here is model file compatibility. There are three ways to fix it:

  1. Worst: partial fix - assume g.total_weight is equal to all.sd->weighted_examples and fix it for datasets without weights only.
  2. Right: add few new fields to model file header and try to address compatibility by checking its version or whatever (didn't check yet what mechanisms are implemented for that yet).
  3. Cautious: pass any new required data as a string like -q and --ngram parameters stored in model. Declare new param just to accept these values which will be undocumented in vw's help. That will keep full binary backward compatibility with old model files for sure. But old vw version still won't be forward compatible with new models and vw don't tolerate unknown parameters.

I would prefer 2. Perhaps there you'll find a reason to prefer 3 or 1. I actually don't know how critical for vw keeping compatibility with old model files. Perhaps someone use them for exchange?

(5) [regressor audit]

Can't recall how I named a param for that. How about --audit_regressor arg where arg is a filename to store audit output?

(6) [libsvm ] Will think about it after (5)

NOT PRIORITIZED ONES:

(4) [higher orders interactions]

It explodes but not so fast in case of simple combinations or when each feature has own namespace. Working with avazu dataset (35mln records) i fond it bearable to have up to 15k features per example. Of course it's not you want to have for exploration stage but for final models it's ok. It's common for kaggle that winning models take day or two to train. In case of simple combinations with 30 features in dataset we'll get following growth:

pow: growth by
1: 30 - total features number
2: 30!/((30-2)!*2!) = 29*30/2 = 435
3: 30!/((30-3!)*3!) = 28*29*30/3! = 4060
4: 30!/((30-4!)*4!) = 27*(28*29*30)/(3!*4) # 27/4 times bigger# = 27405
5: 30!/((30-5!)*5!) = 26*27*(28*29*30) / (3!*4*5) # 26/5 times bigger # = 142506
...

Quite big numbers if you sum them up. (But still i used pow 3 + pow 4-5 for some selected features) it grows on step k by (N-k+1)/k = (N+1)/k - 1 where N is number of original features. So as N is fixed and the growth rate is decreasing and eventually become less than 1. Lets check for some smaller feature space:

for 10 features
pow: growth by
1:  10
2:  45
3:  120
4:  210
5:  252
6:  210
7:  120
8:  45
9:  10
10: 1
sum: 1023

1023 features is something I can survive easily. So for dataset with 10 features I can work with all possible simple combinations. I checked how big could be number of features to fit all possible simple combinations in 10-15k . It's 13 (8091 features total) - 14 (16383 ftrs). And if we fix max power of simple combinations to 5 it will be 17 (9401 ftrs) - 18 (12615 ftrs).

On practice if you have 30 features you can split them in two namespaces (let it be a and b, 15 features in each) and use --cubic aab for all of them (~4000 features with simpl comb mode, i suppose) + -q aaaa -q aaaaa (+ 4363 features with order 5 interaction).

Of course all my evaluations are about technical possibility to work with such feature spaces on my laptop. And that doesn't mean your model generalisation will improve. On practice I found convergence drops down after order 3 interactions, but could be improved with higher order of interactions applied for selected subset of features. Which makes trick described in previous paragraph (splitting namespace to something for higher degrees and something for cubic only) not just "only technically feasible" but also the most useful usecase.

(3). [Caching generated features]

I think you've missed my point for this one. By caching I meant caching generated features in memory for one example, not caching them for all features and storing in cache file for further use. As I have already shown -q aa --cubic aaa will generate additional ~4500 features for 30 features in a namespace of every example ec. And they'll be generated in FOREACH_FEATURE template function in a loop. I know that this template function is called many times for each example on training phase (to calculate prediction, to calculate update per feature, to update feature etc). and every time loop in FOREACH_FEATURE makes 4000 iterations. The idea was to cache these generated features in a new namespace in ec object during first FOREACH_FEATURE call and use them in further calls. They'll be deleted with example ec from memory, so it won't consume much more memory than usual. They shouldn't be written to a cache file.

(7) [--time]

Let's come back to this one later as its priority is low.

(13) [k-means]

It should work as usual clusterisation algorithm. Perhaps outputting closest cluster and distance to it. My intention was to use its model for both train and test datasets to get a new features for them and try to push in these new features in existing gd model.

(12) [constant feature weight output]

If you mean my --audit_regressor then it saves too many data in a file. Of course, constant will be somewhere close to the head of file as it's present in any example. But still it will require some additional efforts to get it out. I believe it shall be something like best constant output - a value which is nice to know once and forget just after that. But it's always printed somewhere and could eventually catch your attention if changed dramatically and indicate that something is goes not as you expected. I understand that VW's output to stdout is already quite big. Perhaps we can make it optional (--print_constant flag, like --no_constant we already have) and disabled by default? Btw the sample output for it in my fork would be following:

truf@truf-laptop:~/ml/$ ./vw -t -i temp.model -q qq --cubic qqq -- print_constant
creating quadratic features for pairs: qq 
creating cubic features for triples: qqq
ignoring namespaces beginning with: L 
only testing
Num weight bits = 28
learning rate = 0.02
initial_t = 1
power_t = 0.5
constant feature weight = 0.000208689
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.088176   0.088176            1         1.0  -1.0000  -2.3840     9164
0.481576   0.874977            2         2.0   1.0000  -0.3356     9164
0.836905   1.192233            4         4.0  -1.0000  -0.4495     9164
0.685445   0.533984            8         8.0  -1.0000  -5.1782     9164
0.600394   0.515343           16        16.0  -1.0000  -0.0491     9164
0.529447   0.458501           32        32.0  -1.0000  -3.3484     9164
0.474734   0.420021           64        64.0  -1.0000  -0.7701     9164
0.440379   0.406023          128       128.0  -1.0000  -1.1180     9164
0.469065   0.497752          256       256.0  -1.0000  -0.9130     9164
0.478631   0.488196          512       512.0  -1.0000  -1.8638     9164
0.468965   0.459299         1024      1024.0  -1.0000  -0.4092     9164
0.464723   0.460481         2048      2048.0   1.0000  -2.1313     9164
0.446528   0.428333         4096      4096.0  -1.0000  -0.6822     9164
0.436118   0.425708         8192      8192.0  -1.0000  -1.0585     9164
0.434248   0.432377        16384     16384.0  -1.0000  -2.6964     9164
0.414527   0.394807        32768     32768.0  -1.0000  -1.0862     9164
0.363563   0.312599        65536     65536.0  -1.0000  -2.6378     9164
0.366156   0.368748       131072    131072.0  -1.0000  -3.1954     9164
0.382256   0.398357       262144    262144.0  -1.0000  -1.6503     9164

finished run
number of examples per pass = 422516
passes used = 1
weighted example sum = 422516
weighted label sum = -278750
average loss = 0.39556
best constant = -1.5847
best constant's loss = 0.456094
total feature number = 3871936624
constant feature weight = 0.000208689

Anyway, it's low prioritised so don't hesitate to reject this proposal.

JohnLangford commented 9 years ago

I think the key is that -q xx is idempotent whereas --lrq xxk is not, but --lrq xxk should be, so I can fix that.

Of course we generally want to put things in the model file so that testing uses the same settings as training, so specifying --lrq xxk at test time is discouraged.

p

On Fri, Feb 13, 2015 at 1:08 PM, John Langford jl@hunch.net wrote:

(1) seems is syntactic sugar, right? It provides no unduplicated functionality, but it does make feature manipulations easier.

For (2) I've used this myself and it is quite common for it to be beneficial. The implementation is crucial though. Basically, you need to detect whether it's a simple case in the core foreach_feature() template and then use a different traversal process.

For (3), I'm unclear on whether this is desirable computationally. A long time ago (things may be different now), I timed explicitly generating quadratics vs. generating them on the fly and found them somewhat faster to generate on the fly. Actually caching the quadratics features would be even worse, because you would blow up the size of the cachefile substantially.

For (4), I like the idea of only have one flag for outer product. However, I'm not sure that supporting higher order interactions than cubic is worthwhile, because the space explodes so quickly. I have only seen higher than 3rd order terms be useful in a greedy search context as per the polynomial learning.

For (5), this seems nice for really big files.

For (6), I believe that --print almost does what you want already. It's a trivial modification, and seems worthwhile if you want to do it.

For (7), I have no real opinion about whether or not --time would be helpful.

For (8), bootstrap is problematic at it's current location in the reduction order. The fundamental problem is that there are at least two uses of bootstrap. One use is creating a diversity of predictors then taking the average to get robustness and possibly improve performance. Another use case comes up with multiclass predictions, where averaging makes no sense, but you would prefer to predict according to a plurality. What should be done? Probably these two kinds of bootstrap should invoke different setup functions which appear at different places in the reduction stack.

For (9), fixing save_resume seems desirable.

For (10), perhaps Paul (cced) should comment.

11,12,14, already merged.

For (13), I don't understand the usage mode/desirability over --audit?

For (15), what is the usage mode you have in mind for online k-means? Is it just that this is expected?

utl\logistic should probably go away, because it's supported directly by vw.

Looking through things my priorities would be: Bugfixes: 2, 8, 9 Plausibly useful: 1, 5, 6

For the others, please respond to comments.

-John

On 02/10/2015 06:14 AM, Alexander Trufanov wrote:

Hi,

Below is my feedback for last 3 moths of VW usage. It's mostly feature requests, some ideas and only few non-critical bugs.

I. VW

I spent last 3 months competing in kaggle's avazu < http://www.kaggle.com/c/avazu-ctr-prediction> contest using only VW and end up 31st on private leaderboard. I suspect many VW users got better result, but I used only one model, no ensembles and was short of RAM, so it seems to be max what I can do. I've used a fork of VW made 3 months ago and implemented dozens of heuristics in it. Some of them may be interesting and few definitely worth discussion. On the other hand I realised that several bugs I found (mostly with new_mf) were already fixed and in the main branch recently so I just wasted a time with them thus I paid my price for hiding info till competition ends.

1. namspaces redefinition

The first thing I have implemented in my VW fork is /--redefine/ parameter which allows me to rename any namespace or set of namespaces. I gave a new namespaces to each feature of original dataset taking their names were from [a-Z] interval. Then I declared a parameter /--redefine/ which takes string argument in form of /x:=n/ where /x/ is a list of old namespaces, /:=/ is an operator and /n/ is a new namespace to which they should be assigned. That allowed me to put something like following in my vw command line:

|--redefine abcdefgihjklmnoprstquwz:=L --redefine abcdegf:=a --ignore L --q aa|

In example above we assign all our namespaces to namespace /L/ and then (order is matter) reassign some of them to namespace /a/. As all features got their own namespaces the namespace /a/ will contain 7 features and /L/ 23-7=16. Then namespace a is used to generate quadratic features.

Why it's convenient? I can now experiment with different features used for quadratic features generation by just addition or removal literals in |--redefine abcdegf:=a|. On practice I wrote a bash script which started with |abcdefgihjklmnoprstquwz:=a| as a baseline and used leave-one-out with validation to find features which removal could improve the result.

This parameter also allows you to do something like

|--redefine abc:=a --ignore a --redefine def:=b --redefine klmn:=c -q bc|

All namespaces which were not redefined will be used as before - as ordinary feature.

Another thing why I'm so enthusiastic about this function is because it's very easy to implement without any performance penalties. My implementation is described in appendix.

2. Simple combinations in -q and --cubic

Sometimes kaggle's dataset contains hashed data to hide sensitive info. Moreover - sometimes the column names are meaningless too and we can only guess what categorical feature it represents. In such circumstances you may often found something like |-q aa| or |--cubic aaa| in vw models. That's because there are too many features to specify |-q| for each pair and better to put all features in one namespace which combined with itself. And there is no any insight on how that namespace may be split to subsets to improve the score. It might be done by just trying all combinations with a script but it's easier to play with including\excluding features in namespace keeping |-q aa| than generating dozens |-q| for namespaces holding single feature. And we can rely on GD + regularization to get rid of useless pairs.

What I'm trying to persuade you is that i'm not stupid and sometimes |-q xx| and |--cubic xxx| (with equal x) params are used. And I even seen such command lines in some discussions and blogs.

Now, assuming |-q aa| -like params are used and perhaps widely, we can recall that |-q ab| and |-q ba| (for namespaces with single features) are equal as well as all permutations of /abc/ for |--cubic|. Well, technically /ab/ and /ba/ will generate features with different addresses in regressor bcs of /quadratic_constant/ but if applied to all examples in dataset - the effect on avg loss shall be equal.

For example for |-q aa| where /a/ contains features /x/ and /y/ we'll get all permutations: /xx/, /xy/, /yx/, /yy/. Although it's obvious that only simple combinations without repetitions have sense: /xy/. (/yx/ we'll be equal, and /xx/, /yy/ are equal to just /x/ and just /y/).

If we consider |-q aa --cubic aaa| for namespace /a/ with 30 features in it we'll get |30 + (-q) 30^2 + (--cubic) 30^3 features = 27930| features per example. And for simple combinations without repetitions we should get |30 + (-q) 30!/2!/28! + (--cubic) 30!/3!/27! features = 30 + 29_30/2 + 28_29*30/6 = 4525| features per example.

Usage of simple combinations instead of permutations in case where interacting namespaces are the same significantly increased speed of my model training (bcs of less number of features per example), speed of convergence of GD (less passes needed) and (if I recall right) the final model result too. Thus I believe simple combinations mode shall be enabled by default.

My current implementation is quite messy. Some ideas regarding better implementation will be discussed further.

3. Caching generated features

Currently generation of quadratic and cubic futures is done in |GD::FOREACH_FEATURE| template and they don't stored anywhere. In kaggle's contests datasets usually have dozens of columns and sometimes hundreds of them (Africa soil prediction competition had ~1200 examples with ~3500 features for each of them). Thus successful models are usually dealing with thousands of features and most of them are generated on fly in this template method. It's known that even in simple gd algorithm |FOREACH_FEATURE| is called several times. . I have implemented a simple caching for generated features by just storing them in a new namespace with a non-printed name (0x07 for example). The cached values are used if such namespace exists or generated and stored in it otherwise.

There may be another approach - creation of a new v_array for such features but it's no so convenient bcs of I.5. I haven't measured performance impact. It shouldn't be big if any except for some extreme datasets.

4. Features with higher orders of interaction

Thanks to I.2 my laptop performance allowed me to try features with more orders of interaction than 3 (cubic). It didn't improve my model (in I.5 I found that only few high order combinations may improve it), but it still was an interesting attempt. Keeping in mind I.2 and I.3 I would rewrite |FOREACH_FEATURE| template at all. What if we put feature generation code out of |GD::| and keep as separate module as some other reductions call to this method too? We can also replace |-q| and |--cubic| parameters with only one (let it be |-q| ) and detect the order of feature interaction by the length of passed argument. So |--cubic abc| becomes |-q abc| (old --cubic may be still supported for compatibility). And my command line would be |vw test.vw -q ab -q abc -q cde -q abbb| for example.

Loops over namespaces for feature generation may be done with recursive calls. Literals in |-q| arguments may be sorted ascendantly on param processing step. In this case if there are letters which are repeated in argument several times (like |-q abcb|) they'll be in neighbour calls. So we can just check in recursive function if next namespace is same as previous and vw in simple combination mode then pass boolean indicating that loop shall start from |i=index+1| instead of |i=1|.

I would also implement caching in it and make a template for /audit_features/ to generate and cache features in audit mode.

5. Regressor audit trick

Once I've decided to audit my model which training had already consumed 5 of 8 available Gb of RAM. Thus I couldn't do that due to lack of RAM without decreasing -b. So I've added a new mode to vw to get values of my regressor together with feature names. It works in a following way:

  • Run vw in a train mode as usual and save regressor to a file.
  • Run vw in new audit mode with |-t|, same training file (cache can't be used) and initialise regressor with saved value (|-i|). Also audit file shall be specified to save results.
  • VW reads examples from train file it has already seen and generates same feature hashes including |-q --cubic|. I've wrote another function for |FOREACH_FEATURE| template which takes feature address and check if it's weight isn't equal to magic number in regressor. If it's not - this weight is written to the output file among with feature name and then weight is replaced with magic number. If weight is equal to magic number - then this feature already was printed out and shall be ignored. As we are using same training file I can be sure that all non-null features in regressor sooner or later will be encountered. And using magic number (/MAX_FLOAT/, as I remember) I'm making sure feature weight will be printed out only once.

In such manner you can process regressor of any size and get its weights with corresponding feature names.

This approach requires addition of support for audit_data to |FOREACH_FEATURE|. It's as fast as test mode and takes just a few MBs more memory (feature names in audit data).

It aso may have /early terminate/ mode as we a priori know number of non empty features (from |save_load_reressor()|) and can decrease this number every time feature gets magic number instead of weight. Work can be halted after this indicator reach 0. I didn't implement this enhancement.

Below is some unnecessary details:

On practice I've applied this to regressor obtained by training on 3.5mln dataset with ~9k features for each example (made with |-q| and |--cubic|). I don't remember how many examples I've processed but eventually I halted VW as output file increased up to 10Gb and I realised that I couldn't process bigger file in R anyway. It's also need to be said that my feature names were in following format: namespace_value. E.g ||device_id device_id_eca8120f|. Generated cubic features had names like|site_id_1984e20a^device_id_eca8120f^banner_pos_0|. Thus I was able to cut out values parts from them with unix /sed/ command. Then I loaded resulted // pairs in R and got much smaller factor for namespaces. Then I analysed means of namespaces weights and number of their occurrences to remove observations which were too rare. This allowed me to identify several synthetic namespaces (generated with --cubic) which had significantly bigge r avg we ight and I could improve my model by generating features with 4-5-6 interactions based on them. I didn't use any |FOREACH_FEATURE| modification for that. I've just added these synthetic features to dataset as new features and |--cubic| has added 2 additional levels of interaction to them.

6. Using approach above to get input file for libsvm?

Just a thought: I considered making a vw mode which allows generation of input file for libsvm. Didn't do it for real.

Vanilla LibSVM has similar format but doesn't support categorical features in input file. As I understood there are tools for LibSVM which can convert categorical data to a number format using hashing or perhaps dictionaries (didn't check how exactly it is done). But I wanted to try features generated by |-q| and |--cubic| too. It looks like there is no equivalent functionality in libsvm, What I was thinking to do is to make a mode in vw alike described in I.5. But there may be only one step - vw reads example, generates |-q| |--cubic| features and stores hashes of these features in output file in libSVM format. No any learning is done. That could be made just by writing a new function for FOREACH_FEATURE template. But I didn't give it a try as according to my estimations my 5gb input file will become 500gb with all cubic features inside. So it's just an idea, perhaps someone found this more useful than me,

7. Time spent in VW's output

Like II.3 I would request printing time that each outputted iteration took since the last printed out value. Let's say |--time| switch which is off by default to not break all 'make test' cases. It may looks like that:

|Reading datafile = train_freq.vw num sources = 1 average since example example current current current time loss last counter weight label predict features spent 0.775936 0.775936 1 1.0 -1.0000 0.1592 33 00:00:00 0.728623 0.681310 2 2.0 -1.0000 -0.0238 33 00:00:00 0.708393 0.688163 4 4.0 -1.0000 -0.0018 33 00:00:00 0.737389 0.766386 8 8.0 -1.0000 0.2023 33 00:00:01 0.796014 0.854638 16 16.0 -1.0000 0.5368 33 00:00:01 0.866274 0.936533 32 32.0 -1.0000 1.0178 33 00:00:01 0.947102 1.027930 64 64.0 -1.0000 1.0452 33 00:00:01 1.061943 1.176785 128 128.0 -1.0000 1.2128 33 00:00:01 1.208501 1.355059 256 256.0 1.0000 0.6986 33 00:00:01 1.211329 1.214156 512 512.0 -1.0000 0.5613 33 00:00:02 1.032622 0.853914 1024 1024.0 -1.0000 -0.3099 33 00:00:04 0.886021 0.739420 2048 2048.0 -1.0000 0.2083 33 00:00:08 0.768442 0.650864 4096 4096.0 -1.0000 -0.8474 33 00:00:16

The usecase behind this is following:

  • You may observe performance changes with change of hyperparameter value (for example |--ngram| or |-q ab| in comparison with |-q ac|) and early terminate if you realise the training speed becoming unacceptable. In general - same may be achieved by launching |time vw test.vw --examples N| but it's not so convenient and require guessing of suitable N.
  • Some training modes may require a non-linear time for their work. At least |--ksvm| seems to slow down non-linear with increase of support vectors count. Well, I couldn't ever make svm work, so may be wrong here.
    1. Bug in --bootstrap

It seems that link functions are applied before bootstrap code and this affects its predictions. Example:

|$ ../vowpalwabbit/vw train-sets/rcv1_small.dat --bootstrap 20 ... average loss = 0.547866

$../vowpalwabbit/vw train-sets/rcv1_small.dat --bootstrap 20 --link=logistic ... average loss = 1.16499 |

Expected result: presence of |--link=logistic| shouldn't change average loss value.

9. Broken save_resume

There is a good option |--save_resume|. It allows me searching hyperparameters for each pass separately. Unfortunately it's broken. For example:

|$ ../vowpalwabbit/vw -k -c train-sets/rcv1_small.dat -f temp.model --passes 2 --holdout_off ... average loss = 0.285058

$ ../vowpalwabbit/vw -k -c train-sets/rcv1_small.dat -f temp1.model --save_resume --holdout_off ... $ ../vowpalwabbit/vw -k -c train-sets/rcv1_small.dat -i temp1.model --save_resume --holdout_off .. average loss = 0.286118 |

I heard about that problem from stackoverflow's question which I was trying to answer but author disappeared without providing a dataset to test. Later I realised that this feature might be helpful for me too, so I reproduced, investigated and fixed (just for my dataset) this problem. It was months ago, so I might not recall all details, but the problem is in |g.total_weight| which is not restored in |gd::save_load_online_state()|. If you were able to correctly restore it - you'll get the correct average sum after two invocations of vw.

Then you'll find one more problem - you'll get incorrect loss in a very first avg loss output after loading from |--save_resume| model. Only for a first one. That's because |old_weighted_examples| not restored or something like that. Don't remember exactly. I never fixed the second bug as i had no time for that. Still it's a problem as for big files on pass2 or pass3 you'll get only one printed output (with default |-P| behaviour) and it's misleading to see broken loss there and correct avg loss in vw's summary test.

As for 1st problem - to fix it you shall change model file. I didn't do that, as my dataset didn't has feature or example weights, so I've just added this line |g.total_weight = all.sd->weighted_examples;| after |save_load_online_state()| call. In such dataset |total_weight| and |weighted_examples| are equal and |sd->weighted_examples| is restored from model file. But for |train-sets/rcv1_small.dat| dataset used as example above that's not true. Its |weighted_examples == 1000| and |total_weight| is only 920. It seems because features have weights. If you try my code you'll get |average loss = 0.285076| for command 3 which is still incorrect. So saving this value to the model file seems to be only robust way to fix the problem.

10. lrq in model file

Params passed with |-lrq| are stored in model file. When you load it for testing they'll be doubled if you forget to exclude your |--lrq| params from the command line. This duplicity make prediction completely wrong. In case of |-q| there is some fool protection for such case:

vw data -f model -q qq vw -t data -i model vw -t data -i model -q qq

Commands 2 and 3 we'll give you same result.

vw data -f model -lrq qq vw -t data -i model vw -t data -i model -lrq qq

Here command 3 returns wrong result different from command 2

11. Is parseFloat() too cautious?

I suspect that |strtod()| is unnecessary called in |parseFloat()| in |parse_primitives.h| in case the float ends with a new line bcs of https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/ parse_primitives.h#L96 |if (_p == ' ')//easy case succeeded.| Perhaps it's better to use |if (_p == ' ' || *p == '\n')//easy case succeeded.|

I would also include |'\t'| as it's default separator when you join two file with Unix /paste/ command

12. Multiple regressor initialisation

In |initialize_regressor()| in |parse_regerssor.cc| it's worth adding else between these blocks

if (all.random_weights) {...} if (all.random_positive_weights) {...} if (all.initial_weight != 0.) {...}

The |random_positive_weights| is set to true automatically with |--new_mf| mode in |mf.cc|. I found that these random values ( |0.1 * frand48()| ) are too big for my dataset (9000 features per example). So I initialised regressor with help of |--initial_weight| with something like 1e-5 and got better convergence. But now regressor initialises twice. I would put /else/ clauses between these 3 /if/'s and reorder them so |all.initial_weight| comes first.

13. Outputting constant feature weight

I found it interesting and sometimes useful (to select better |initial_weight| value, for example) to print out weight of a constant feature which vw adds by default. The code is:

| #vw.h inline void print_constant_weight(vw& all) { float w = get_weight(all, constant, 0); cerr << "constant feature "; if (all.initial_constant != 0.) cerr <="(" << all.initial_constant << ") "; cerr << "weight = " << w << endl; }

usage

      if (all->add_constant)
         VW::print_constant_weight(*all);

|

I also print it out in case regressor was initialised from file just after that happens. So it may be displayed twice - before training pass and at its end.

14. Typos.

The help line for initial_weight says: |--initial_weight arg Set all weights to an initial value of 1.|

Shall " be to an initial value of arg." I suppose.

15. Online clustering

I believe VW needs some clustering algorithm. I made one for case where all features weights are equal to 1. It's not something I would push into master branch - it's not production quality. For example, I didn't implement |save_load| methods and used |--passes 2| to switch into test mode on second pass and print out results. Worst of all, my kaggle model couldn't benefit from this results, so the whole algorithm may work incorrectly. Still I think online kmeans is a must for VW and would like to start a discussion about that. I'll publish the code bits with my notes later.

II. Functionality of utl\vw-csv2bin

It seems that script vw-csv2bin isn't flexible enough so every time I'm using my own code written in C++ or R. To make it a bit more usable I would ask to add a few functions to it:

  • A param to skip the header line which may be in input file.
  • A param to identify column with tag value and code to correctly process it. *

    A param that allows to ignore some columns. Would be better to have 2: |--keep| and |--ignore| like in vw. And would be great to have a support of intervals, like |--ignore 12:50| - use all columns except starting from 12 to 50. Of course, this will require accepting several values, like |--ngram 1 --ngram 6q| in vw.

    *

    A params to specify namespaces for columns. This is a most tricky one.

  • Something that allows to convert numerical data into categorical by adding suffix to value

Last two is quite difficult to do. If someone ever want to try I can suggest an argument format for these parameters.

III. Few notes on vw-hypersearch

Thanks to Ariel vw-hypersearch become one of my favourite tools and I wouldn't say there is much to improve here.

1.

I'm always use |-v| mode and found it very useful especially when you're just started to work with this script and wonder how it executes commands it |-t| mode. Now we also have |-c| mode and |-v| param may become even more useful for users. Unfortunately one won't know about its existence till look inside the code as it's not described in standard help output. Could you please add it there?

2.

Few times I wished vw-hypersearch supports simple grid search. Not sure how useful that but it seems to be simple to implement. Let's just add |-g X| param which will switch to the grid search mode with step /X/. If search interval is /[a,b]/ then script could run vw for /a/, /a+x/, /a+2/x/../a+n/x/, /b/. If /a+x > b/ then just for /a/, /b/. If /x/ is negative (seems that such could be passed in command line as |-g=-1| or with a few other tricks in Unix) it shall run vw command for /b/, /b+x/, /b+n/x* ... /a/.

3.

It would be great to print out execution time for every vw launch at the end of the line:

trying 47 ......................................... 0.463323 00:30:15 trying 33 ......................................... 0.469066 [00:28:56] trying 56 ......................................... 0.462171 00:34:15

The usecase for that is searching parameters which directly affect vw performance: |--ngram|, |--lrq|, |--rank| etc. Sometimes you could find that you are getting better results on validation step with increase of rank (for example) but that increase just not worth performance penalty you take. This may be triggered by |--time| param (off by default)

4.

The complicated one: vw-hypersearch may accept non-numeric params. It's technically not so difficult to iterate characters in string instead of numbers in a grid search. And there are vw params which accept characters, like |--ngram|, |-q|, |--cubic|. For example passing |vw-hypersearch -g abcd vw test.vw -q %f| shall launch

vw test.vw -q af vw test.vw -q bf vw test.vw -q cf vw test.vw -q ef
5.

And the most complicated: support of multiple /%/ in a grid mode. Each /%/ may be treated as inner iteration loop. It has most practical sense in case of characters:

vw-hypersearch -g abc vw test.vw -q %% vw test.vw -q aa vw test.vw -q ab vw test.vw -q ac vw test.vw -q ba ... vw test.vw -q cb vw test.vw -q cc

Would be great if it won't be limited by number of inner loops and will support |--cubic %%%| for example.

6.

If 5 will be ever implemented I would request a |--combinations| key for it to switch on this mode and generate combinations instead of permutations. That's needed as '-q ab' and '-q ba' are equal. Imagine you have N characters or numbers and 3 inner loops: /i1/, /i2/, /i3/. In default mode loops are /i1:{1..N}/, /i2:{1..N}/, /i3:{1..N}/ and in |--simple_combinations| they are /i1:{1..N-2}/, /i2:{i1, N-1}/, /i3:{i2,N}/. Thus |vw-hypersearch -g abc vw test.vw -q %% --combinations| will be just:

vw test.vw -q aa vw test.vw -q ab vw test.vw -q ac vw test.vw -q bb vw test.vw -q bc vw test.vw -q cc

IV. utl\logistic

I was quite surprised to get this help output by launching

|$../vowpal_wabbit/utl/logistic --help

logistic version [unknown] calling Getopt::Std::getopts (version 1.10 [paranoid]), running under Perl version 5.20.1.

Usage: logistic [-OPTIONS [-MORE_OPTIONS]] [--] [PROGRAM_ARG1 ...]

The following single-character options are accepted: Boolean (without arguments): -0 -1

Options may be merged together. -- stops processing of options. [Now continuing due to backward compatibility and excessive paranoia. See 'perldoc Getopt::Std' about $Getopt::Std::STANDARD_HELP_VERSION.] |

It seems to be some Perl automatically generated summary which isn't informative enough as doesn't print the meaning of arguments, Real help page could be get by launching the script without parameters:

$../vowpal_wabbit/utl/logistic Usage: logistic [Options] [files...] Maps label x (1st number on line) to logistic(x) Options: -0 Use logistic(x) -> [0 .. 1] range -1 Use logistic(x) -> [-1 .. 1] range (default)

Could we make |./logistic --help| work as |./logistic| to be more "unix-style"

APPENDIX

The bits of |--redefine| implementation

  1. Declare
bool redefine_some; unsigned char redefine[256];

in |struct vw| in |global_data.h|

1.

Add redefine option in |parse_args.cc| somewhere near keep/ignore
params declaration. In my fork it would be:
|("redefine", po::value< vector<string> >(), "redefine namespaces
beginning with characters from string n... as namespace k. <arg>
shall be in form 'n:=k'. Use ':=k' to redefine all namespaces.")|

2.

Add following implementation after param declaration:

| all.redefine_some = false; // false by default

if (vm.count("redefine")) { // all.redefine[i] will keep new namespace literal for i-th old namespace // by defalt it's i itself.

   for (size_t i = 0; i < 256; i++) // by default
       all.redefine[i] = i;

       // note that declaration order is matter here
       // so --redefine :=L --redefine ab:=M  --ignore L  will ignore

all except a and b under M vector< string > redefinitions = vm["redefine"].as< vector< string

(); for (vector::iterator par = redefinitions.begin(); par != redefinitions.end(); par++){ { string arg = *par; int operator_pos = -100; bool operator_found = false; unsigned char new_namespace = ' ';

           // let's find operator ':=' and new namespace after it

           for (size_t i = 0; i < arg.length(); i++)
           {
               if (arg[i] == ':')
                   operator_pos = i;
               else if ( (arg[i] == '=') && (operator_pos ==

(int)i-1) ) operator_found = true; else if (operator_found) { new_namespace = arg[i]; break; } }

           if (!arrow_found || new_namespace == ' ')
           {
               cerr << "argument of --redefine is malformed. Valid

format is n:=k" << endl; throw exception(); }

           all.redefine_some = true; //at least one valid

redefinition rule is found

           if (operator_pos != 0)
               for (size_t i = 0; i < (size_t) arr_pos; i++)
                   all.redefine[(size_t)arg[i]] = new_namespace; //

namespaces from substr before operator are redefined else for (size_t i = 0; i < 256; i++) all.redefine[i] = new_namespace; //substr is empty

  • all nammespaces must be redefined (case of ':=L')

     }

    } } |

1.

In |nameSpaceInfo()| in |parse_example.cc|
After
|index = (unsigned char)(*reading_head);|
add this line:
|if (all->redefine_some) index = all->redefine[index];|

2.

There is no need to do something for case of reading from cache
file as namespaces stored there will be already redefined. On the
other hand this makes impossible usage of cache file in automatic
feature search with |--redefine|.

— Reply to this email directly or view it on GitHub https://github.com/ JohnLangford/vowpal_wabbit/issues/510.

JohnLangford commented 9 years ago

For (1), using ":=" is a bit unfortunate because that is the define operator in some branches of mathematics, and it go right-to-left rather than left-to-right. Instead of changing operators though, can you change the order of specification? So it's "a:=efg"?

For (2), perhaps we should discuss (3),(4),(5) first.

For (8) Grep for 'reductions' in parse_args.cc, and you'll see how we first create a stack of reductions, then call setup through them to create a compiled stack of learning reductions. By manipulating the location in the order, you can manipulate which reductions call which.
I think we really need to shift around where it is to achieve a good solution---even (b) below is rather antimodular.

For (9), Every model file has a version number embedded. We can check that and bail/warn appropriately. In general, I much prefer using string parameters for arguments passed in because that achieves forward compatibility. I need to spend some time making a pass removing header parameters and converting them into string parameters where that is possible. I regard this as not possible for save_resume though, because this is really about VW communicating with itself.

(More later)

-John

On 02/14/2015 03:51 AM, Alexander Trufanov wrote:

Ok, prioritised first: (1) [redefine tag] Are you ok with param name and operator? It could be --rename_ns or whatever instead of redefine. As for operator (":=") - it could be any 2 chars except for '>','|','<','#'.

(2) [Simple combinations in -q and --cubic] I would add a new bool for this mode in vw class, turn it all by default and add a parameter (--combinations 0, --simple_combinations 0 or flag --permutations) to switch it off if someone ever need that. If you are ok pls choose the param name. Also, I agree that implementation matters. This point is linked with (3),(4) and (5) and I'll try to advocate them first.

(8) [bootstrap problem]

I've tried to fix this today among with 3 pull requests sent today but realised its complexity. I can't refactor bootstrap usage - I have no comprehensive vision of vw architecture plus current reduction setup wasn't implemented in my old fork. But I think out 2 dumb ways to fix for it. a. Bad one - detect usage of link function and reverse its effect before averaging prediction in bootstrap code. b. Better one - keep reference to link function somewhere in shared data and store original prediction in example class just before link is applied to it. In bs we can use these unprocessed predictions and apply right link function at the end. I think second one might be a temporary solution.

_(9) saveresume problem.

The tough place here is model file compatibility. There are three ways to fix it:

  1. Worst: partial fix - assume g.total_weight is equal to all.sd->weighted_examples and fix it for datasets without weights only.
  2. Right: add few new fields to model file header and try to address compatibility by checking its version or whatever (didn't check yet what mechanisms are implemented for that yet).
  3. Cautious: pass any new required data as a string like |-q| and |--ngram| parameters stored in model. Declare new param just to accept these values which will be undocumented in vw's help. That will keep full binary backward compatibility with old model files for sure. But old vw version still won't be forward compatible with new models and vw don't tolerate unknown parameters.

I would prefer 2. Perhaps there you'll find a reason to prefer 3 or 1. I actually don't know how critical for vw keeping compatibility with old model files. Perhaps someone use them for exchange?

(5) [regressor audit]

Can't recall how I named a param for that. How about |--audit_regressor arg| where arg is a filename to store audit output?

(6) [libsvm ] Will think about it after (5)

NOT PRIORITIZED ONES:

(4) [higher orders interactions]

It explodes but not so fast in case of simple combinations or when each feature has own namespace. Working with avazu dataset (35mln records) i fond it bearable to have up to 15k features per example. Of course it's not you want to have for exploration stage but for final models it's ok. It's common for kaggle that winning models take day or two to train. In case of simple combinations with 30 features in dataset we'll get following growth:

pow: growth by 1: 30 - total features number 2: 30!/((30-2)!_2!) = 29_30/2 = 435 3: 30!/((30-3!)_3!) = 28_29_30/3! = 4060 4: 30!/((30-4!)4!) = 27(28_29_30)/(3!_4) # 27/4 times bigger# = 27405 5: 30!/((30-5!)_5!) = 2627(28_29_30) / (3!_4*5) # 26/5 times bigger # = 142506 ...

Quite big numbers if you sum them up. (But still i used pow 3 + pow 4-5 for some selected features) it grows on step /k/ by |(N-k+1)/k = (N+1)/k - 1| where N is number of original features. So as N is fixed and the growth rate is decreasing and eventually become less than 1. Lets check for some smaller feature space:

for 10 features pow: growth by 1: 10 2: 45 3: 120 4: 210 5: 252 6: 210 7: 120 8: 45 9: 10 10: 1 sum: 1023

1023 features is something I can survive easily. So for dataset with 10 features I can work with all possible simple combinations. I checked how big could be number of features to fit all possible simple combinations in 10-15k . It's 13 (8091 features total) - 14 (16383 ftrs). And if we fix max power of simple combinations to 5 it will be 17 (9401 ftrs) - 18 (12615 ftrs).

On practice if you have 30 features you can split them in two namespaces (let it be /a/ and /b/, 15 features in each) and use |--cubic aab| for all of them (~4000 features with simpl comb mode, i suppose) + |-q aaaa -q aaaaa| (+ 4363 features with order 5 interaction).

Of course all my evaluations are about technical possibility to work with such feature spaces on my laptop. And that doesn't mean your model generalisation will improve. On practice I found convergence drops down after order 3 interactions, but could be improved with higher order of interactions applied for selected subset of features. Which makes trick described in previous paragraph (splitting namespace to something for higher degrees and something for cubic only) not just "only technically feasible" but also the most useful usecase.

(3). [Caching generated features]

I think you've missed my point for this one. By caching I meant caching generated features in memory for one example, not caching them for all features and storing in cache file for further use. As I have already shown |-q aa --cubic aaa| will generate additional ~4500 features for 30 features in /a/ namespace of every example /ec/. And they'll be generated in FOREACH_FEATURE template function in a loop. I know that this template function is called many times for each example on training phase (to calculate prediction, to calculate update per feature, to update feature etc). and every time loop in FOREACH_FEATURE makes 4000 iterations. The idea was to cache these generated features in a new namespace in ec object during first FOREACH_FEATURE call and use them in further calls. They'll be deleted with example /ec/ from memory, so it won't consume much more memory than usual. They shouldn't be written to a cache /f ile/.

(7) [--time]

Let's come back to this one later as its priority is low.

(13) [k-means]

It should work as usual clusterisation algorithm. Perhaps outputting closest cluster and distance to it. My intention was to use its model for both train and test datasets to get a new features for them and try to push in these new features in existing gd model.

/(12) [constant feature weight output] /

If you mean my --audit_regressor then it saves too many data in a file. Of course, constant will be somewhere close to the head of file as it's present in any example. But still it will require some additional efforts to get it out. I believe it shall be something like best constant output - a value which is nice to know once and forget just after that. But it's always printed somewhere and could eventually catch your attention if changed dramatically and indicate that something is goes not as you expected. I understand that VW's output to stdout is already quite big. Perhaps we can make it optional (--print_constant flag, like --no_constant we already have) and disabled by default? Btw the sample output for it in my fork would be following:

|truf@truf-laptop:~/ml/$ ./vw -t -i temp.model -q qq --cubic qqq -- print_constant creating quadratic features for pairs: qq creating cubic features for triples: qqq ignoring namespaces beginning with: L only testing Num weight bits = 28 learning rate = 0.02 initial_t = 1 power_t = 0.5 constant feature weight = 0.000208689 using no cache Reading datafile = num sources = 1 average since example example current current current loss last counter weight label predict features 0.088176 0.088176 1 1.0 -1.0000 -2.3840 9164 0.481576 0.874977 2 2.0 1.0000 -0.3356 9164 0.836905 1.192233 4 4.0 -1.0000 -0.4495 9164 0.685445 0.533984 8 8.0 -1.0000 -5.1782 9164 0.600394 0.515343 16 16.0 -1.0000 -0.0491 9164 0.529447 0.458501 32 32.0 -1.0000 -3.3484 9164 0.474734 0.420021 64 64.0 -1.0000 -0.7701 9164 0.440379 0.406023 128 128.0 -1.0000 -1.1180 9164 0.469065 0.497752 256 256.0 -1.0000 -0.9130 9164 0.478631 0.488196 512 512.0 -1.0000 -1.8638 9164 0.468965 0.459299 1024 1024.0 -1.0000 -0.4092 9164 0.464723 0.460481 2048 2048.0 1.0000 -2.1313 9164 0.446528 0.428333 4096 4096.0 -1.0000 -0.6822 9164 0.436118 0.425708 8192 8192.0 -1.0000 -1.0585 9164 0.434248 0.432377 16384 16384.0 -1.0000 -2.6964 9164 0.414527 0.394807 32768 32768.0 -1.0000 -1.0862 9164 0.363563 0.312599 65536 65536.0 -1.0000 -2.6378 9164 0.366156 0.368748 131072 131072.0 -1.0000 -3.1954 9164 0.382256 0.398357 262144 262144.0 -1.0000 -1.6503 9164

finished run number of examples per pass = 422516 passes used = 1 weighted example sum = 422516 weighted label sum = -278750 average loss = 0.39556 best constant = -1.5847 best constant's loss = 0.456094 total feature number = 3871936624 constant feature weight = 0.000208689 |

Anyway, it's low prioritised so don't hesitate to reject this proposal.

— Reply to this email directly or view it on GitHub https://github.com/JohnLangford/vowpal_wabbit/issues/510#issuecomment-74367539.

trufanov-nok commented 9 years ago

as for (1) --redefine Good point. Then param arg will be in form k:=s where s is a string of namespaces and k is a new namespace. I'm thinking about its special cases now. This one is simple: --redefine :=s - all s namespaces are redefined to ' ' (space) character and thus technically will be removed. The one that confuse me is --redefine k:=. Initially I used empty s to redefine all namespaces. That was because I didn't noticed that there already was a special character ":" which works as a wildcard for namespaces implemented in -q. So now I'm going to process parameter like that: --redefine k:= -- all features which don't belong to any namespace (belong to ' ') are redefined to k --redefine := -- nothing is done --redefine k:=: -- all namespaces are redefined to k --redefine :=: -- all namespaces are ignored (redefined to ' ' - default namespace)

But I wonder why : was chosen as a wildcard instead of *? Shall I stick with this char or there are plans to change it?

(8) [bootstrap problem] I leave it on someone else then.

(9) save_resume problem.

Ok, I'll add a new fields to model file, increment its format version and make sure VW still compatible with the old model files too. The warning will be printed if both file ver is old and --save_resume is used.

JohnLangford commented 9 years ago

For 1,9, sounds good.

For (5), --audit_regressor sounds like a good starting point.

For (4), I think the primary question is how to implement things. The options I see are: (a) hack the foreach_feature template in gd.h. (b) create a custom reduction which explodes the set of features manually. (c) modify setup_example() in parser.cc along the lines of (b).

Approach (b) is the most modular.

Approach (a) was the most efficient for -q a long time ago. But the advantage was small (~10%), and the code base has changed significantly since I evaluated that.

For (3), I think I understand what you are saying. The way to do that is along the lines of (b). However, it's not clear to me that it will be a win given my previous experience. The foreach_feature is the inner loop, but since it's templated and the hash is simple, it's very fast to compute.

For (13), creating a custom base learner that does k-means seems quite feasible and plausibly interesting. Making it be useful as a source of features for another learning algorithm is somewhat trickier, but plausibly doable. In essence, you could make the k-means operate as a reduction which takes as input an example then augments it with cluster-center features.

-John

On Tue, Feb 17, 2015 at 3:27 PM, Alexander Trufanov < notifications@github.com> wrote:

as for (1) --redefine Good point. Then param arg will be in form k:=s where s is a string of namespaces and k is a new namespace. I'm thinking about its special cases now. This one is simple: --redefine :=s - all s namespaces are redefined to ' ' (space) character and thus technically will be removed. The one that confuse me is --redefine k:=. Initially I used empty s to redefine all namespaces. That was because I didn't noticed that there already was a special character ":" which works as a wildcard for namespaces implemented in -q. So now I'm going to process parameter like that: --redefine k:= -- all features which don't belong to any namespace (belong to ' ') are redefined to k --redefine := -- nothing is done --redefine k:=: -- all namespaces are redefined to k --redefine :=: -- all namespaces are ignored (redefined to ' ' - default namespace)

But I wonder why : was chosen as a wildcard instead of *? Shall I stick with this char or there are plans to change it?

(8) [bootstrap problem] I leave it on someone else then.

_(9) saveresume problem.

Ok, I'll add a new fields to model file, increment its format version and make sure VW still compatible with the old model files too. The warning will be printed if both file ver is old and --save_resume is used.

Reply to this email directly or view it on GitHub https://github.com/JohnLangford/vowpal_wabbit/issues/510#issuecomment-74746049 .

JohnLangford commented 9 years ago

For (12), I regard the best constant's loss as much more interesting than the best constant's value. The first answers an important natural question: Are you doing better than a simplistic baseline?

What question does the best constant's value answer? If it changed significantly, what would that mean?

-John

On Wed, Feb 18, 2015 at 3:00 PM, John Langford jl@hunch.net wrote:

For 1,9, sounds good.

For (5), --audit_regressor sounds like a good starting point.

For (4), I think the primary question is how to implement things. The options I see are: (a) hack the foreach_feature template in gd.h. (b) create a custom reduction which explodes the set of features manually. (c) modify setup_example() in parser.cc along the lines of (b).

Approach (b) is the most modular.

Approach (a) was the most efficient for -q a long time ago. But the advantage was small (~10%), and the code base has changed significantly since I evaluated that.

For (3), I think I understand what you are saying. The way to do that is along the lines of (b). However, it's not clear to me that it will be a win given my previous experience. The foreach_feature is the inner loop, but since it's templated and the hash is simple, it's very fast to compute.

For (13), creating a custom base learner that does k-means seems quite feasible and plausibly interesting. Making it be useful as a source of features for another learning algorithm is somewhat trickier, but plausibly doable. In essence, you could make the k-means operate as a reduction which takes as input an example then augments it with cluster-center features.

-John

On Tue, Feb 17, 2015 at 3:27 PM, Alexander Trufanov < notifications@github.com> wrote:

as for (1) --redefine Good point. Then param arg will be in form k:=s where s is a string of namespaces and k is a new namespace. I'm thinking about its special cases now. This one is simple: --redefine :=s - all s namespaces are redefined to ' ' (space) character and thus technically will be removed. The one that confuse me is --redefine k:=. Initially I used empty s to redefine all namespaces. That was because I didn't noticed that there already was a special character ":" which works as a wildcard for namespaces implemented in -q. So now I'm going to process parameter like that: --redefine k:= -- all features which don't belong to any namespace (belong to ' ') are redefined to k --redefine := -- nothing is done --redefine k:=: -- all namespaces are redefined to k --redefine :=: -- all namespaces are ignored (redefined to ' ' - default namespace)

But I wonder why : was chosen as a wildcard instead of *? Shall I stick with this char or there are plans to change it?

(8) [bootstrap problem] I leave it on someone else then.

_(9) saveresume problem.

Ok, I'll add a new fields to model file, increment its format version and make sure VW still compatible with the old model files too. The warning will be printed if both file ver is old and --save_resume is used.

Reply to this email directly or view it on GitHub https://github.com/JohnLangford/vowpal_wabbit/issues/510#issuecomment-74746049 .

trufanov-nok commented 9 years ago

Ok, after recent commits i'm going to take a few months break to spend more time on own project. But I'll keep an eye on VW to make sure there won't be any problems caused by newly implemented interactions generator and will jump in if needed.

After that I would like to return to discussion of suggestion 5.

arielf commented 9 years ago

Alex, Thanks so much for all your effort on interactions. Highly appreciated!

JohnLangford commented 9 years ago

Yes, thanks much. That was a long haul, and I was impressed with your tenacity.

-John

On 05/24/2015 04:14 AM, Ariel Faigon wrote:

Alex, Thanks so much for all your effort on interactions. Highly appreciated!

— Reply to this email directly or view it on GitHub https://github.com/JohnLangford/vowpal_wabbit/issues/510#issuecomment-104989954.