Prediction memory error with new PLT --probabilities option

FabianKaiser commented 1 year ago

Describe the bug

When using the new PLT --probabilities option and using long input texts, after some predictions the model simply stops. Azure ML shows an error (e.g. malloc(): invalid size (unsorted)), while shell just stops while predicting.

This does not occur on any data. The relevant requirement seems to be seems to be long input data.

reddit_data_sample.csv

How to reproduce


import os
from typing import Tuple
from vowpalwabbit import Workspace
import pandas as pd
import numpy as np
import gc
from sklearn.metrics import precision_score, recall_score, f1_score

def to_vw_format(text: str, label=None) -> str:
    if label is None:
        label = ''
    return f'{label} |text {text}'

def evaluate(labels: pd.DataFrame):
    print(f"precision: {precision_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"recall: {recall_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"f1: {f1_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")

    true_positives = labels[labels.columns.difference(['target'])].apply(lambda x: x == labels['target']).sum()
    print(f"accuracy: {true_positives['pred'] / len(labels)}")

data = pd.read_csv('reddit_data_sample.csv')
# this is only to make the input longer, otherwise the error does not occur
data.text = data.text.apply(lambda x: x * 20)

target_var = 'target'
training_var = 'text'

cleaned_targets = data[target_var].dropna()
unique_labels = cleaned_targets.unique().tolist()
num_classes = len(unique_labels)
numbers = list(range(num_classes))
mapping = dict(zip(unique_labels, numbers))

training_data = data.sample(frac=0.9, random_state=25)
testing_data = data.drop(training_data.index)

training_data = training_data.dropna(subset=[training_var]).sample(frac=1).reset_index(drop=True)

vw_training_file_name = 'train.vw'

with open(vw_training_file_name, "wb") as f:
    for text, label in zip(training_data[training_var], training_data[target_var]):
        vw_label = mapping[label]
        example = to_vw_format(
            text, vw_label) + ' \n'
        f.write(example.encode())

os.makedirs('model', exist_ok=True)

params = {
    'loss_function': 'logistic',
    'data': vw_training_file_name,
    'c': True,
    'k': True,
    'f': 'model/model.vw',
    'compressed': True,
    'plt': num_classes,
    'b': 30,
    'passes': 50,
    'example_queue_limit': 256,
    'learning_rate': 0.5,
}
model = Workspace(**params)

model.finish()

del model
gc.collect()

params = {
    'loss_function': 'logistic',
    'predict_only_model': True,
    'i': 'model/model.vw',
    'threshold': 0.0,
    'probabilities': True,
}
model = Workspace(**params)

predictions = testing_data.text.apply(lambda x: np.flip(np.argsort(model.predict(to_vw_format(x))))[0]).rename('pred')
evaluate(pd.concat([predictions, testing_data.target.apply(lambda x: mapping[x])], axis=1))

del model
gc.collect()

Version

9.7.0

OS

Linux

Language

Python

Additional context

No response

mwydmuch commented 1 year ago

Hello @FabianKaiser, thank you for reporting the issue and providing the code to replicate it. I will look into this by the end of this week.

mwydmuch commented 1 year ago

@jackgerrits Ok, so I checked this one and I have a few comments. The problem is not present when predicting using VW CLI - predicting with the same data and model trained using python script. So it seems that the bug is related to Python. I don't know what the best way to debug Vowpal Wabbit's python bindings is, so I haven't yet tried to find the source of the issue. But when running the code provided by @FabianKaiser (this is the first time I used this binding after the recent update of PLT with --probabilities option) I noticed that probabilities options for PLT reduction in Python are not correctly implemented. When the option is off, the predict method returns a list of indices/ids of predicted labels (ints) as expected, e.g., [153, 614, 151, 606, 156]. But when --probabilities option is turned on the method returns only the list of scores (floats), e.g., [0.753, 0.542, 0.32, 0.111, 0.05] without providing information about labels indices/ids making this option useless in Python. In case of using --probabilities option, the predict method in Python should return something like the list of tuples with both id and score, e.g., [(153, 0.753), (614, 0.542), (151, 0.32), (606, 0.111), (156, 0.05)], or some similar format.

FabianKaiser commented 1 year ago

Hello @mwydmuch , Thanks! I have a comment about the output. This is actually similar to the probability option of the OOA. The output is simply in sorted by class. This works with the threshold 0.0 option, becuase this means we output every probability and thus the index maps to the class. It is broken without this option though.

mwydmuch commented 1 year ago

@FabianKaiser Yes, I noticed the "threshold": 0.0 you used and yes, it works in this case. But this isn't really the intended way of using PLT. If you need all probabilities, then it is probably better to just use OAA. The main feature of PLT is that it allows quickly predicting only top-k or above the given threshold. So, the probabilities option is basically broken for Python bindings (it works correctly in CLI VW, and returns the output in <label>:<score> format).

jackgerrits commented 1 year ago

Okay, so for the original issue around the memory issue. Using 30 bits is quite high. It is going to result in a model of 32 bits in this instance (since GD by default adds two bits). So the model itself is 4GB of memory usage. Is it possible you are running out of memory? There certainly could be a memory issue in Python but I wanted to explain that bit first.

I didn't consider that when we used "action_scores" to expose the probabilities here. Interesting. I am not sure the original reason, but instead of being a list of tuples when action_scores get exposed to python the list gets flattened according to the indices. This clearly is an issue for sparse results.

I am not sure what the best path is for fixing in the main python bindings. However, I have been working on a new set of Python bindings for some time to fix many of these sort of legacy issues. In the new python bindings, the output of PLT is a list of tuples of class to probability so the threshold 0 trick is not needed. It would also be interesting to know if your memory issue if fixed when using these.

The biggest difference for you is probably the fact that the new bindings intentionally do not expose VW's progressive loss and leave it up to the user to evaluate the metrics they care about for their problem. (This is the plan for now at least, very happy to discuss approaches and concerns) The flow on of this is that when doing multipass learning with these new bindings the control is totally up to the user in terms of passing the data multiple times, calling end_pass and managing a holdout set themselves.

So, I converted your problem to the new bindings. (Make sure to use version 0.3.0 which I am in the progress of releasing now for the end_pass function)

pip install vowpal-wabbit-next==0.3.0

import vowpal_wabbit_next as vw
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score

def to_vw_format(text: str, label=None) -> str:
    if label is None:
        label = ''
    return f'{label} |text {text}'

def evaluate(labels: pd.DataFrame):
    print(f"precision: {precision_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"recall: {recall_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"f1: {f1_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")

    true_positives = labels[labels.columns.difference(['target'])].apply(lambda x: x == labels['target']).sum()
    print(f"accuracy: {true_positives['pred'] / len(labels)}")

data = pd.read_csv('reddit_data_sample.csv')
target_var = 'target'
training_var = 'text'

cleaned_targets = data[target_var].dropna()
unique_labels = cleaned_targets.unique().tolist()
num_classes = len(unique_labels)
numbers = list(range(num_classes))
mapping = dict(zip(unique_labels, numbers))

training_data = data.sample(frac=0.9, random_state=25)
testing_data = data.drop(training_data.index)

training_data = training_data.dropna(subset=[training_var]).sample(frac=1).reset_index(drop=True)

params = {
    'loss_function': 'logistic',
    'plt': num_classes,
    'bit_precision': 30,
    'learning_rate': 0.5,
}

cmdline = [f"--{k}" if isinstance(v, bool) else f"--{k}={v}" for k, v in params.items()]
model = vw.Workspace(cmdline)

text_parser = vw.TextFormatParser(model)
num_passes = 3
for i in range(num_passes):
    print(f"pass {i}")
    for text, label in zip(training_data[training_var], training_data[target_var]):
        vw_label = mapping[label]
        model.learn_one(text_parser.parse_line(to_vw_format(text, vw_label)))
    model.end_pass()

learned_model = model.serialize()

params = {
    'loss_function': 'logistic',
    'probabilities': True,
    "top_k": 1
}
cmdline = [f"--{k}" if isinstance(v, bool) else f"--{k}={v}" for k, v in params.items()]
model = vw.Workspace(cmdline, model_data=learned_model)

predictions = testing_data.text.apply(lambda x: model.predict_one(text_parser.parse_line(to_vw_format(x)))[0][0]).rename('pred')
evaluate(pd.concat([predictions, testing_data.target.apply(lambda x: mapping[x])], axis=1))

FabianKaiser commented 1 year ago

Hello @jackgerrits, Thanks a lot! I will try out the new bindings, and see if it solves the problem. Can you give an estimate, when this will be released? Also, how does it work together with the issue #4511 ?

jackgerrits commented 1 year ago

It's already released (0.3.0), so you can try it out.

Since orchestrating the multi-pass is up to the user the same issue is not present. (Where the lack of loss calculation caused early exit). Also the loss is not exposed, so what this means is you need to evaluate the result to determine if and when you can early exit from multipass.

The reason loss is not exposed is because when doing multipass you need a holdout set to calculate this correctly and it's too easy (imo) to accidentally use progressive validation to "cheat" unintentionally. This all comes about because the user has more control over connecting the pieces together instead of just letting VW run its driver as it would do so for a traditional CLI use case. So, ideally the user uses a holdout set and knows the metric they care about in order to stop training early.

FabianKaiser commented 1 year ago

I am sorry. I meant, if you can estimate when it will be released in an official release. I assume, the vowpal-wabbit-next package is more like a preview version?

I did try it out, and the error does not occur. I will however have to test creating a holdout set. Is it possible to still expose it with a parameter?

jackgerrits commented 1 year ago

The vowpal-wabbit-next package is an official release. It's a pre 1.0 package so there will be some API fluctuations (which are mentioned in the release notes), but in my opinion it is already more robust than the vowpalwabbit package.

How would you propose integrating a holdout set and loss calculation into the package? I left it out since I think it is too easy for a user to accidentally get it wrong so I would love to hear your ideas.

FabianKaiser commented 1 year ago

Hello @jackgerrits , @mwydmuch

Ah, good to know. I assumed the whole vw-next package would be a preview package. As it solves the probability and memory issue, the new package could then be an option for us. The problem we see, is that there is no "--data" option like in the old bindings. Our use case has up to 500000 documents with considerable text length. To manually run several passes in python loops with "learn_one" for so many and large examples will definitely increase our training time by severalfold- which is not an option. It was a large plus for us, that the old bindings simply called the VW CLI and used much faster C code to work on such large datasets.

Maybe there was misunderstanding regarding the loss. We do not use the loss in any way - we are very happy to use the VW CLI and let it work its holdout and loss magic.

I do not have an idea how to integrate the holdout and loss calculation in the new package - I was hoping it would be possible to call the CLI as in the current package and skip the learn_one by using a multipass training file.

I also have another question regarding the original issue. Is it possible the memory error is caused by the threshold 0.0 option (which seems to be a not intended way of using the plt) and our large test sets (we are doing k-fold cross validation after training, which results in tests sets of up to 50000 documents)? I did not really understand where the problem in the current bindings lies - is it not possible to add a tuple return format to ensure the top_k option works in the current bindings so that we can safely ignore the threshold option?

jackgerrits commented 1 year ago

Yeah, so the new Python bindings are built with interactivity in mind which is why driver things are not exposed in the same way that the old bindings worked.

However, the ability to run the CLI driver in a way very similar to how it worked in the old bindings is available which seems like what you require for your training. The function you are looking for is run_cli_driver.

See here for an update to the previous code snippet that relies on the C++ driver instead of using a driver implemented in Python. You should see CLI level performance from this call.

This is the previous snippet updated to use the CLI API:

import vowpal_wabbit_next as vw
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score
import os

def to_vw_format(text: str, label=None) -> str:
    if label is None:
        label = ''
    return f'{label} |text {text}'

def evaluate(labels: pd.DataFrame):
    print(f"precision: {precision_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"recall: {recall_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")
    print(f"f1: {f1_score(labels['target'], labels['pred'], average='weighted', zero_division=0)}")

    true_positives = labels[labels.columns.difference(['target'])].apply(lambda x: x == labels['target']).sum()
    print(f"accuracy: {true_positives['pred'] / len(labels)}")

data = pd.read_csv('reddit_data_sample.csv')
data.text = data.text.apply(lambda x: x * 20)
target_var = 'target'
training_var = 'text'

cleaned_targets = data[target_var].dropna()
unique_labels = cleaned_targets.unique().tolist()
num_classes = len(unique_labels)
numbers = list(range(num_classes))
mapping = dict(zip(unique_labels, numbers))

training_data = data.sample(frac=0.9, random_state=25)
testing_data = data.drop(training_data.index)

training_data = training_data.dropna(subset=[training_var]).sample(frac=1).reset_index(drop=True)

vw_training_file_name = 'train.vw'

with open(vw_training_file_name, "wb") as f:
    for text, label in zip(training_data[training_var], training_data[target_var]):
        vw_label = mapping[label]
        example = to_vw_format(
            text, vw_label) + ' \n'
        f.write(example.encode())

os.makedirs('model', exist_ok=True)

params = {
    'loss_function': 'logistic',
    'data': vw_training_file_name,
    'cache': True,
    'kill_cache': True,
    'final_regressor': 'model/model.vw',
    'compressed': True,
    'plt': num_classes,
    'bit_precision': 30,
    'passes': 50,
    'example_queue_limit': 256,
    'learning_rate': 0.5,
}
cmdline = [f"--{k}" if isinstance(v, bool) else f"--{k}={v}" for k, v in params.items()]
print(vw.run_cli_driver(cmdline))

learned_model = open('model/model.vw', 'rb').read()

params = {
    'loss_function': 'logistic',
    'probabilities': True,
    "top_k": 1
}
cmdline = [f"--{k}" if isinstance(v, bool) else f"--{k}={v}" for k, v in params.items()]
model = vw.Workspace(cmdline, model_data=learned_model)
text_parser = vw.TextFormatParser(model)
predictions = testing_data.text.apply(lambda x: model.predict_one(text_parser.parse_line(to_vw_format(x)))[0][0]).rename('pred')
evaluate(pd.concat([predictions, testing_data.target.apply(lambda x: mapping[x])], axis=1))

I also have another question regarding the original issue. Is it possible the memory error is caused by the threshold 0.0 option (which seems to be a not intended way of using the plt) and our large test sets (we are doing k-fold cross validation after training, which results in tests sets of up to 50000 documents)?

It is certainly possible as it would further increase memory usage. It's still unclear to me if you are hitting a leak or just large memory usage due to the scale (num classes and num bits used).

I did not really understand where the problem in the current bindings lies - is it not possible to add a tuple return format to ensure the top_k option works in the current bindings so that we can safely ignore the threshold option?

It's certainly fixable, but to change it could be considered a breaking change which is why I wanted to explore other avenues too. There is a fixed mapping between the "type" of prediction and how it makes it into Python and that mapping in the old bindings has the issue of trying to flatten the action_scores type which PLT uses.

mwydmuch commented 1 year ago

@FabianKaiser @jackgerrits The error seems to be unrelated to the number of bits used since much smaller values of bits cause the same error at the same time. However, it is certainly related to some memory management related to action_scores. Changing the amount of data per instance data.text = data.text.apply(lambda x: x * 20) from provided example to, e.g., data.text = data.text.apply(lambda x: x * 10) causes the error to occur not after the first data point but after a few. The number of predicted scores (controlled by the threshold or top_k parameter) also causes that memory error occurs earlier or later (sometimes error occurs only when the model object is deleted).

@FabianKaiser Because of the above, I wouldn't recommend using the old bindings. The vowpal-wabbit-next seems to be a good replacement and workaround for a moment. Please note, however that the current version of vowpal-wabbit-next builds against ~VW 9.7.0, so it doesn't contain the fix from #4511. So you may want to manually update/build against VW master branch.

@jackgerrits Just to let you know, at the moment, I'm not planning to try to fix it as it seems to be not directly a problem with the PLT reduction implementation in C++, but rather Python bindings. Btw. this vowpal-wabbit-next is super cool, I was always missing this kind of flexibility with VW!

jackgerrits commented 1 year ago

@jackgerrits Just to let you know, at the moment, I'm not planning to try to fix it as it seems to be not directly a problem with the PLT reduction implementation in C++, but rather Python bindings. Btw. this vowpal-wabbit-next is super cool, I was always missing this kind of flexibility with VW!

Not a problem at all! And I am very glad you like it!

FabianKaiser commented 1 year ago

As this will not be adapted here and already works in vowpal-wabbit-next, I will close the issue. Thanks a lot!

VowpalWabbit / vowpal_wabbit