crate / cratedb-examples

A collection of clear and concise examples how to work with CrateDB.
Apache License 2.0
9 stars 7 forks source link

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` #170

Closed amotl closed 6 months ago

amotl commented 9 months ago

Dear @andnig,

the CI caught an error from automl_timeseries_forecasting_with_pycaret.py ^1.

FAILED test.py::test_file[automl_timeseries_forecasting_with_pycaret.py] - ValueError: Input contains NaN.

Apparently, it started tripping like this only yesterday ^2, so it is likely the error is related to changed input data.

However, the result of debugging this error may well converge into a corresponding issue at PyCaret, because its promises are so high. On the other hand, the code may just need a particular data cleansing step, to accomodate the situation. May I ask you to have a look?

With kind regards, Andreas.

amotl commented 9 months ago

It is likely the error is related to changed input data.

Thinking about it once more, it is more likely that some dependency library of PyCaret was not pinned correctly, and that something changed in this area.

andnig commented 9 months ago

@amotl All dependencies are pinned except the crate sqlalchemy one. We can assume it's not related to pycaret itself. Pycaret automatically interpolates nan values except if there are ONLY nan vals (or there are no values), which might indicate an issue with the testing infrastructure, connection or database. Before I dig out my debug-rod, there were no changes in either the test runner or crate sqlalchemy which come to your mind which might prevent reading data via pandas?

amotl commented 9 months ago

All dependencies are pinned except the crate sqlalchemy one. We can assume it's not related to pycaret itself.

That's true, but I am talking about transitive dependencies of PyCaret. I think it is the most likely reason, but sure it can also be different.

andnig commented 9 months ago

Hours and hours of debugging into dependencies of pycaret, googling the term transitive dependencies - just to find, that the test still ran on python 3.10 - life of a developer is fun😄 https://github.com/crate/cratedb-examples/actions/runs/7036786822/job/19150177672?pr=171

Can you confirm that the test is green?

To be honest I'm not sure if this issue is really resolved yet, as the pycaret timeseries notebook test was always green but the script version of it failed. Smells like flaky test or environment. Let's monitor the situation - but as the PR test is green for now, will not invest more time for now. Good for you?

amotl commented 9 months ago

Thank you very much for your efforts. Sure, let's merge the PR, close this issue, and monitor the situation into the future for similar events.

amotl commented 9 months ago

I am just re-running to most recently failed https://github.com/crate/cratedb-examples/actions/runs/7027445018, in order to rule out that it is related to the time-of-day when the test is executed.

If it will fail again, it is likely that the upgrade to Python 3.11 resolved the situation in one way or another, and that your debugging efforts had a positive outcome.

amotl commented 9 months ago

Aha, it is green again, so it was actually just a fluke. However, it is an interesting one which can also easily hit production applications, depending on what the actual root cause was.

andnig commented 9 months ago

This is related to how the tests in our repo here are designed. The model training pipeline itself is not of concern - see some of the reasons for why this error happens above. I know this error quite well from my projects - it happens if the data are not available as expected.

amotl commented 9 months ago

Hi again.

I think the root cause for this is actual the venerable CellTimeoutError, i.e. the Notebook just runs too much system load, see, for example, ^1:

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

^^ Do you see any chance to make this spot more efficient on CI, @andnig?

With kind regards, Andreas.

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870 -- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2872

amotl commented 9 months ago

Another occurrance of the venerable CellTimeoutError. It also happens on a setup() call, but this time, on a different one.

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           from pycaret.classification import setup, compare_models, tune_model, ensemble_model, blend_models, automl, \
E               evaluate_model, finalize_model, save_model, predict_model
E           
E           s = setup(
E               data,
E               target="Churn",
E               ignore_features=["customerID"],
E               log_experiment=True,
E               fix_imbalance=True,
E           )
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2746

amotl commented 9 months ago

We found the reason for this was mainly due to a misconfiguration of the MLFLOW_TRACKING_URL. It has been fixed on behalf of GH-174, unless further notice. Thanks for your support, @andnig!

amotl commented 7 months ago

Hi again. This issue is still present, and is constantly haunting us, which is unfortunate.

The most recent occurrance, just about two hours ago, happened after we tried to re-schedule the corresponding job to run on day times, as we figured it would work better. Turns out, it doesn't help.

Now, looking a bit closer at the error output, I am just now also spotting this warning:

  /opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
  STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

  Increase the number of iterations (max_iter) or scale the data as shown in:
      https://scikit-learn.org/stable/modules/preprocessing.html
  Please also refer to the documentation for alternative solver options:
      https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

-- https://github.com/crate/cratedb-examples/actions/runs/7854158803/job/21434611151#step:6:1155

Could that actually be related to the job occasionally (50/50) stalling/freezing/timing out?

andnig commented 7 months ago

Hey Andreas, happy to chime in. :wave:

  1. Please think about separating these two topics for more clarity. CellTimeout and Input NAN are mostly two separate issues if they still occur both. Celltimeout is more often than not related to the jupyter test runner (or, well, simple timeouts), while input nan can mean multiple things, more common ones are either data not there, your infrastructure is about to get killed or training iterators running amok.

  2. As your test infrastructure is quite limited in terms of CPU power, we added the PYTEST_CURRENT_TEST env variable which only runs 3 models which are also rather fast to train. If I remember correctly we used two ets model variants and a naive one. From the logs you shared it seems however that all the models are trained. (also the non-converge error is related to a model which we excluded for test runs).

I would suggest utilizing the PYTEST_CURRENT_TEST environment variable for both, the ipynb and the py tests to reduce training time and potentially solve both issues related to how you test these nbs. Please just make sure that the env vars are "visible" for the jupyter notebooks as well. Exact config depends on which jupyter test runner you use.

I hope this helps so far, let me know, how it goes.


PS: As you mentioned that the tests fail 50/50 but on quick glance I was only able to find 2 failed tasks, would you mind checking if the input nan failures are always on notebook tests or also on .py file tests?

amotl commented 7 months ago

Hi Andreas, thanks for your quick reply.

From the logs you shared it seems however that all the models(!) are trained, [while we intended to only run a few of them]. [I can] also [spot] a non-converge error, which is related to a model which we excluded for test runs. [Most probably, PYTEST_CURRENT_TEST is not getting evaluated properly.] Please just make sure that the env vars are "visible" for the jupyter notebooks as well.

That's to the point. I also had the suspicion that the measures we took last time, to bring down required compute resources, did not work well, or had flaws, but I did not analyze the log output yet about this topic. So, if you think this is the issue still tripping us, I now have a thing to hang on and investigate. Thank you so much!

With kind regards, Andreas.

amotl commented 7 months ago

Hi again. We've explored the situation, and the outcome is that we can confirm that the call to compare_models works well, including its guard using a corresponding if "PYTEST_CURRENT_TEST" in os.environ clause.

I wouldn't know why it should be different on GHA. So, maybe the selected algorithms ["arima", "ets", "exp_smooth"] / ["ets", "et_cds_dt", "naive"] are still too heavy on CPU and/or memory?

andnig commented 7 months ago

Hi Andreas, if you look at the logs it's not a timeout error, it's the nan input error. As mentioned above I'd suggest to keep these two issues separated. The timeout issue is most probably related to the jupyter test runner. This input nan error however is not related to jupyter.

If I look at the failed run, I see the the esm model has an incredibly high MASE and RMSSE. This mostly indicates that the model is not very well suited for the data. I suggested it, as it is very lightweight, but well, too lightweight as it seems :sweat:

Untitled

To go forward, you could:

  1. Use a different model for the test run, one which has less MASE. Run the whole pycaret model suite locally and select one of the top 5 models instead of the exp_smooth one, for your test run.
  2. If this does not help, can you provide some local reproduction steps? If you can reproduce it locally, I'm better able to help.
amotl commented 7 months ago

Thanks, and sorry that I mixed up those two different errors again. I've diverted those into separate issues now, so this one can be closed after carrying over the relevant information.

amotl commented 6 months ago

After splitting the issue up into different tickets, but without applying any other fixes, we are currently not facing any problems on nightly runs of the corresponding CI jobs.

Therefore, I am closing the issue now, for the time being. Thanks again, @andnig!