VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.49k stars 1.93k forks source link

vw core CLI: --nn data-leak: SGD update is incorrect #4614

Open arielf opened 1 year ago

arielf commented 1 year ago

Describe the bug

Using --nn causes an unexpected data-leak between separate name-spaces The update goes the wrong way (against the desired gradient towards minimum loss).

This is especially important for correctness of learning from time-series data-sets.

Examples coincident in time should not be able to interact. The current vw update is violating this principle when --nn is used. i.e. learning results are not generalizable when --nn is used with a time series + coincident examples (separated using separate name-spaces) after sorting by time.

How to reproduce

Full code and data to reproduce is here:

https://github.com/arielf/vw-bugs/tree/main/nn-data-leak

Version

9.7.0 (git commit: fc9ab2549)

OS

Linux, Ubuntu 20.04

Language

CLI/ C++

Additional context

I tried my best to explain the bug here.

https://github.com/arielf/vw-bugs/tree/main/nn-data-leak

My guess is that the leak happens through the full-connectivity of the features via the hidden layer.

Since the full connectivity is a done-deal (imposed at the start of run by the fact we want a fully-connected NN.)

It seems to me that the SGD update should somehow skip updates to weights that have nothing to do with the ones in the example. IOW the skips should be in run-time (rather than initialization time) and should update only those target feature-nodes that are present in the current example (and/or namespace).

Ideally, this skip vs non-skip (current default) should be controlled by a CLI switch. --respect_namespaces --restricted_update or something like this.

Would appreciate taking a look, thanks!

ataymano commented 1 year ago

Can be reproduced on simpler configuration: data:

4.0 always4|always4  f1:1
2.0 always2|always2  f1:1
4.0 always4|always4  f1:1
2.0 always2|always2  f1:1
4.0 always4|always4  f1:1
2.0 always2|always2  f1:1
4.0 always4|always4  f1:1
2.0 always2|always2  f1:1
4.0 always4|always4  f1:1
2.0 always2|always2  f1:1

command line:

--noconstant --learning_rate 2 --nn 1

Looking.

ataymano commented 1 year ago

Can you please elaborate your suggestion? Updates to weights that are associated to features are already isolated, but both hidden and last layer have no notion of mapping to certain feature. 1) They are getting updated on every example and affecting every prediction 2) They are not affected by --noconstant flag

Here is what happens with all weights of single unit network after seeing the sequence of a/x, b/x, a/x features.

from vowpalwabbit import pyvw
import json
import pandas as pd

def weights_pd(weights_str):
    return pd.DataFrame([
        {
            'name': f'{w["terms"][0]["namespace"]}/{w["terms"][0]["name"]}',
            'value': w['value']}
            for w in json.loads(weights_str)['weights']])

vw = pyvw.Workspace('--noconstant --nn 1 --dump_json_weights_include_feature_names_experimental --invert_hash inv.txt')

vw.learn('1 |a x:1')
print('-----------AFTER SEEING a/x-------------------')
print(weights_pd(vw.json_weights()))

vw.learn('1 |b x:1')
print('-----------AFTER SEEING b/x-------------------')
print(weights_pd(vw.json_weights()))

vw.learn('1 |a x:1')
print('-----------AFTER SEEING a/x again-------------------')
print(weights_pd(vw.json_weights()))

Output (please see the fact that "x" weight stays the same after seeing example with different namespace, OutputWeight got changed after second a/x example and OutputLayerConst is always getting changed):

-----------AFTER SEEING a/x-------------------
                name     value
0        /HiddenBias -0.316100
1                a/x -0.244499
2      /OutputWeight -0.302343
3  /OutputLayerConst  0.393395
-----------AFTER SEEING b/x-------------------
                name     value
0                b/x -0.171394
1        /HiddenBias -0.316100
2                a/x -0.244499
3      /OutputWeight -0.302343
4  /OutputLayerConst  0.604431
-----------AFTER SEEING a/x again-------------------
                name     value
0                b/x -0.171394
1        /HiddenBias -0.316100
2                a/x -0.333082
3      /OutputWeight -0.369246
4  /OutputLayerConst  0.709840
arielf commented 1 year ago

Hi Alexey,

First, thanks so much for looking into this.

I understand that the hidden layer currently has no notion of name-space isolation.

Not sure if this is too involved (won't fix, or "it is a feature, not a bug") but ideally, Assuming "anything is possible", there would be a new switch (--restricted_updates) when --nn is in effect.

This switch would modify the current behavior to skip (not update) any hidden-layer pairing of features where any of the two features crossed is not in the present example name-space.

Is this clearer? Too hard/involved to implement?

This is not so critical for me. I figured out a way around the issue. Just thought it would be valuable to report this anyway since this behavior is unexpected in the ways demonstrated: (1) non-monotonic convergence (2) behavior unlike with other reductions.

Thanks again!

ataymano commented 1 year ago

It sounds like it means that hidden layer is never going to be updated? But maybe I am missing something. We have open meeting every other Thursday (including one this week) - do you want to join and maybe discuss this proposal with the team? Also, as a workaround (maybe?) - nn can passthrough inputs directly to the last layer (--inpass) - it is not guaranteeing separation that you need but still increasing the chance of it to happen (at least simple example like the one in this issue bceome monotonic).

arielf commented 1 year ago

Fixed the link, which wasn't working. Thanks so much for the invite. Sorry, I can't make it.

So from what you say, I understand each node in the hidden layer is fully connected (to all possible inputs). If so, then you're correct, and my previous "idea" wouldn't work because any name-space interacts with all other name-spaces as a necessary side-effect of the full connectivity in the hidden-layer (shared weights in all nodes).

--inpass already exists as an sub-option to --nn, but it is not disabling passing via the hidden-layer, (it is in addition) correct? Maybe a mode doing only the --inpass would be a useful option. I do want crossing of features, but only if they are in the same name-space (like other vw algos work)

I realize this may not be trivial.