dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.55k stars 470 forks source link

TypeError when saving a model with `numpy.bool_` types #424

Closed nishaq503 closed 1 year ago

nishaq503 commented 1 year ago

numpy.bool_ types are not being correctly serialized to json.

What is the current behavior? The ComplexEncoder class (here) does not handle numpy.bool_ which is not JSON serializable. This raises a TypeError when saving certain models.

If the current behavior is a bug, please provide the steps to reproduce.

model = TabNetClassifier(...)
model.fit(...)  # training data and model parameters contain values of type numpy.bool_
model.save_model('path/to/model')

Expected behavior numpy.bool_ should be cast to python's bool before being serialized to JSON. Here is my suggested fix. Please let me know if this is acceptable for a PR:

class ComplexEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.int64):
            return int(obj)
        if isinstance(obj, np.bool_):
            return bool(obj)
        # Let the base class default method raise the TypeError
        return json.JSONEncoder.default(self, obj)

Other relevant information: poetry version: "poetry-core>=1.0.0" python version: "^3.9" Operating System: "Linux Kernel 5.18.14-arch1-1" Additional tools: CUDA Version: 11.7 Driver Version: 515.57

Additional context

Here's a stacktrace:

  File ".venv/lib/python3.10/site-packages/pytorch_tabnet/abstract_model.py", line 375, in save_model
    json.dump(saved_params, f, cls=ComplexEncoder)
  File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File ".venv/lib/python3.10/site-packages/pytorch_tabnet/utils.py", line 339, in default
    return json.JSONEncoder.default(self, obj)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bool_ is not JSON serializable

I ran into this when trying tabnet in a kaggle competition. If you need to, you can look here in my code where the error happens.

Optimox commented 1 year ago

Thanks for your contribution.

Could you please explain to me when does the error occur and does not ? I don't understand how there can be sometimes a numpy.bool_ and sometimes a python bool ?

Training data won't change the model weights or architecture so when does this occur ?

nishaq503 commented 1 year ago

Thanks for the quick response. I hope the following answers your questions. I am happy to give more clarification and answer more questions that you might have.

When I save a model, when the model_params.json file is being written (here), it seems that some model parameters in the saved_params dictionary are of type numpy.bool_. numpy.bool_ cannot be serialized for JSON.

I know that python's bool can be serialized so I tried adding that extra if statement in the ComplexEncoder class. It worked and so I suggested it as a fix.

I don't know how, or even if, the training data can cause model parameters to take up the numpy.bool_ type. The original data were a combination of string, floating-point, integer, boolean, and categorical types. I preprocessed and encoded all training data to be of numpy.float16 dtype before feeding it to the model. The choice of numpy.float16 was mostly for memory concerns as the data are quire large.

Optimox commented 1 year ago

Ok thank you I'll look into it! @eduardocarvp any idea on when this could happen ?

eduardocarvp commented 1 year ago

I had a quick look, but I don't know where this can be coming from... I don't see how training data could change the weights either. Have you made any changes to the model/architecture at all?

Anyway, I agree with the fix, but would be good to know why it's happening. I will have a deeper look later.

nishaq503 commented 1 year ago

I didn't change any of the internal structure of the model. If it helps, here is the set of input parameters I used:

model = TabNetClassifier(
    n_d=32,
    n_a=32,
    n_steps=3,
    gamma=1.3,
    n_independent=2,
    n_shared=2,
    momentum=0.02,
    lambda_sparse=1e-3,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=1e-3, weight_decay=1e-3),
    scheduler_fn=torch.optim.lr_scheduler.CosineAnnealingWarmRestarts,
    scheduler_params={
        'T_0': 5,
        'eta_min': 1e-4,
        'T_mult': 1,
        'last_epoch': -1,
    },
    mask_type='entmax',
    seed=cfg.seed,
)

model.fit(
    numpy.array(train_x),
    numpy.array(train_y.values.ravel()),
    eval_set=[(numpy.array(valid_x), numpy.array(valid_y.values.ravel()))],
    max_epochs=128,
    patience=10,
    batch_size=1024,
    eval_metric=['auc', 'accuracy', AmexMetric],
    from_unsupervised=unsupervised_model,
)

AmexMetric is a custom metric that, partially, relies on computing an AUC-ROC score. I added the unsupervised_model after making that change to save the model.

Optimox commented 1 year ago

@nishaq503,

Is there any chance that you share a Kaggle notebook that reproduces your error?

How come that this notebook https://www.kaggle.com/code/medali1992/amex-tabnetclassifier-feature-eng-0-791 seems to be working just fine ?

damvantai commented 1 year ago

You can use torch to save and load model! import torch torch.save(clf_model, "./model_1")

To load

clf_model = torch.load("./model_1")

Optimox commented 1 year ago

@damvantai no it's better to use the built in method and do clf.save("your/path")

andreas-wolf commented 1 year ago

I have the same problem with an int8:

Object of type int8 is not JSON serializable

It's also raised from the ComplexEncoder. It seems to come from {'preds_mapper': {'0': 0, '1': 1}} where the values 0 and 1 have the type np.int8 (apparently because my target variable is an int8 like the OP seems to use a bool for their target).

So as a workaround for the time being one could cast the target variable to np.int64 which seems to be the only np.intX ComplexEncoder can encode right now.

Optimox commented 1 year ago

thx @andreas-wolf, does this happen in AMEX Competition as well ? What is the environment your are using ?

andreas-wolf commented 1 year ago

@Optimox Hi. I don't know if that happens in the AMEX competition, but I guess so, since the json encoding is not working for dtypes other than np.int64.

Sorry for not being clear enough in my description of the problem. I've attached therefor a minimal working example to trigger the bug.

As said the problem is that y_train aka the target variable is of type bool (or np.int8 in my case) and you're only handling np.int64 in ComplexEncoder https://github.com/dreamquark-ai/tabnet/blob/5ac55834b32693abc4b22028a74475ee0440c2a5/pytorch_tabnet/utils.py#L338

https://github.com/dreamquark-ai/tabnet/blob/5ac55834b32693abc4b22028a74475ee0440c2a5/pytorch_tabnet/utils.py#L336-L341

  import os
  import wget
  import pandas as pd
  import numpy as np
  from pathlib import Path
  from pytorch_tabnet.tab_model import TabNetClassifier
  url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
  dataset_name = 'census-income'
  out = Path(os.getcwd()+'/data/'+dataset_name+'.csv')
  out.parent.mkdir(parents=True, exist_ok=True)
  if out.exists():
      print("File already exists.")
  else:
      print("Downloading file...")
      wget.download(url, out.as_posix())
  features = ['39', ' 77516', ' 13']
  train = pd.read_csv(out)
  train = train[features + [' <=50K']]
  train['target'] = train[' <=50K'] == '<=50K'
  train = train.drop(columns=[' <=50K'])
  if "Set" not in train.columns:
      train["Set"] = np.random.choice(["train", "valid", "test"], p =[.8, .1, .1], size=(train.shape[0],))

  train_indices = train[train.Set=="train"].index
  valid_indices = train[train.Set=="valid"].index
  test_indices = train[train.Set=="test"].index

  X_train = train[features].values[train_indices]
  y_train = train['target'].values[train_indices]

  X_valid = train[features].values[valid_indices]
  y_valid = train['target'].values[valid_indices]

  X_test = train[features].values[test_indices]
  y_test = train['target'].values[test_indices]

  clf = TabNetClassifier()
  clf.fit(X_train=X_train, y_train=y_train,max_epochs=2)

  saving_path_name = "./tabnet_model_test_1"
  saved_filepath = clf.save_model(saving_path_name)
ShihHsuanChen commented 1 year ago

I have the same problem with an int8:

Object of type int8 is not JSON serializable

It's also raised from the ComplexEncoder. It seems to come from {'preds_mapper': {'0': 0, '1': 1}} where the values 0 and 1 have the type np.int8 (apparently because my target variable is an int8 like the OP seems to use a bool for their target).

So as a workaround for the time being one could cast the target variable to np.int64 which seems to be the only np.intX ComplexEncoder can encode right now.

https://github.com/dreamquark-ai/tabnet/blob/5ac55834b32693abc4b22028a74475ee0440c2a5/pytorch_tabnet/utils.py#L336-L341

How about replacing line 338~339 by

         if isinstance(obj, (np.generic, np.ndarray)): 
             return obj.tolist()

It seems that only TabNetClassifier object has this problem. The type of {'preds_mapper': {'0': 0, '1': 1}} values are given by user when user call TabNetClassifier.fit. Using numpy method tolist() can solve all similar problems not only for np.bool_ but also np.int32 or other numpy generic types. On the other way, it maybe better to convert train_labels value to JSON compatible types before assign to preds_mapper.

https://github.com/dreamquark-ai/tabnet/blob/cab643b156fdecfded51d70d29072fc43f397bbb/pytorch_tabnet/tab_model.py#L45-L64

rafamarquesi commented 1 year ago

I have the same problem with an int8:

Object of type int8 is not JSON serializable

It's also raised from the ComplexEncoder. It seems to come from {'preds_mapper': {'0': 0, '1': 1}} where the values 0 and 1 have the type np.int8 (apparently because my target variable is an int8 like the OP seems to use a bool for their target). So as a workaround for the time being one could cast the target variable to np.int64 which seems to be the only np.intX ComplexEncoder can encode right now.

https://github.com/dreamquark-ai/tabnet/blob/5ac55834b32693abc4b22028a74475ee0440c2a5/pytorch_tabnet/utils.py#L336-L341

How about replacing line 338~339 by

         if isinstance(obj, (np.generic, np.ndarray)): 
             return obj.tolist()

It seems that only TabNetClassifier object has this problem. The type of {'preds_mapper': {'0': 0, '1': 1}} values are given by user when user call TabNetClassifier.fit. Using numpy method tolist() can solve all similar problems not only for np.bool_ but also np.int32 or other numpy generic types. On the other way, it maybe better to convert train_labels value to JSON compatible types before assign to preds_mapper.

https://github.com/dreamquark-ai/tabnet/blob/cab643b156fdecfded51d70d29072fc43f397bbb/pytorch_tabnet/tab_model.py#L45-L64

I had the same problem with uint8. I followed @ShihHsuanChen's tip, changed lines 338~339, as he mentioned, and it worked for me.

Optimox commented 1 year ago

thanks I'll fix this soon

gauravbrills commented 1 year ago

when will this be solved :| ? @Optimox any timeline . Also any workaround for this

Optimox commented 1 year ago

I don't have a timeline to share. I think making sure during training that the targets columns has type int instead of np.int should solve the problem, I never had this problem to be honest.

gauravbrills commented 1 year ago

Ahh I did try let that [I think I did do that based on the discussion thread] .. yes we do have some conversions in between . For now had tried to do a joblib dump as a workaround .

gauravbrills commented 1 year ago

Thanks @Optimox the above comment solved my issue re converted the types I was shrinking to save data for the labels .