MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
718 stars 102 forks source link

Input contains NaN, infinity or a value too large for ('float64') #105

Closed paulomann closed 1 year ago

paulomann commented 1 year ago

Description

I am trying to run the Google Colab example provided in the repo README. I only changed the dataset, to load a custom dataset using the load_custom_dataset_from_folder() in the .tsv format. I executed the algorithm with a small vocab (39 words) without problems, but with a "big" vocabulary (7894 words), I got an error from sklearn.utils.validation.py as follows:

Also, note that my dataset is split into train (70%), val (10%) and test (20%).

What I Did

Current call:  0
Current call:  1
Current call:  2
Current call:  3
Current call:  4
Current call:  5
Current call:  6
Current call:  7
Current call:  8
Current call:  9
Current call:  10
Current call:  11
Current call:  12
Current call:  13
Current call:  14
/usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-31-2dff32b1aedb>] in <cell line: 2>()
      1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
      3     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      4     model_runs=model_runs, save_models=True,
      5     extra_metrics=None, # to keep track of other metrics

10 frames
[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py] in _assert_all_finite(X, allow_nan, msg_dtype)
    101                 not allow_nan and not np.isfinite(X).all()):
    102             type_err = 'infinity' if allow_nan else 'NaN, infinity'
--> 103             raise ValueError(
    104                     msg_err.format
    105                     (type_err,

ValueError: Input contains NaN, infinity or a value too large for **dtype('float64').**
paulomann commented 1 year ago

I also tried running locally, although with a different version and environment

Octis: 1.10.2 Python: 3.7.3 OS: Linux

And I got this full traceback, and by inspection I got a value of -inf for f_val

Traceback (most recent call last):
  File "/home/paulomann/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/paulomann/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/paulomann/.vscode-server/extensions/ms-python.python-2023.12.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/paulomann/.vscode-server/extensions/ms-python.python-2023.12.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/paulomann/.vscode-server/extensions/ms-python.python-2023.12.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/paulomann/.vscode-server/extensions/ms-python.python-2023.12.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 322, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/paulomann/.vscode-server/extensions/ms-python.python-2023.12.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 136, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/paulomann/.vscode-server/extensions/ms-python.python-2023.12.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/paulomann/workspace/reddit-topic-modelling/octis_training/training_and_optimization.py", line 102, in <module>
    model_runs=5, plot_best_seen=True) # number of runs of the topic model
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 160, in optimize
    results = self._optimization_loop(opt)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 288, in _optimization_loop
    res = opt.tell(next_x, f_val)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/skopt/optimizer/optimizer.py", line 493, in tell
    return self._tell(x, y, fit=fit)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/skopt/optimizer/optimizer.py", line 536, in _tell
    est.fit(self.space.transform(self.Xi), self.yi)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 304, in fit
    accept_sparse="csc", dtype=DTYPE)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/sklearn/base.py", line 432, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 805, in check_X_y
    ensure_2d=False, dtype=None)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 645, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/home/paulomann/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 99, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
paulomann commented 1 year ago

It was related to a topic that was absent in the dataset --- due to some bug, I had a vocabulary with words that were not in the primary dataset.