LSSTDESC / rail_tpz

RAIL-wrapped version of a "lite" version of Matias Carrasco-Kind's TPZ tree-based photo-z code
MIT License
0 stars 0 forks source link

RecursionError: maximum recursion depth exceeded #19

Open hdante opened 2 months ago

hdante commented 2 months ago

Hello, when executing the inform pass of rail_tpz with a certain input file, I've received the following error:

(base) [henrique.almeida@loginapl01 henrique.almeida]$ rail-train -a tpz train3.hdf5 estimator_tpz.new.pkl 
Start: 2024-04-24 19:02:14.130021
Estimator algorithm: tpz
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Starting setup.
Loading all program modules...
Configuring trainer...
Loading input file...
Setup done.
Starting training.
self._parallel is mpi, number of processors we will use is 1
using native TPZ decision trees
creating 8 random realizations...
making a total of 40 trees for 8 random realizations * 5 bootstraps
0 - 40 -------------> to core 0
making 1 of 40...
Traceback (most recent call last):
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/bin/rail-train", line 182, in <module>
    if __name__ == '__main__': main()
                               ^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/bin/rail-train", line 173, in main
    train(cfg, ctx)
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/bin/rail-train", line 162, in train
    ctx.trainer.inform(ctx.input)
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/informer.py", line 65, in inform
    self.run()
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/tpz_lite.py", line 261, in run
    T = TPZ.Rtree(traindata.X, traindata.Y, forest='yes',
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/ml_codes/TPZ.py", line 429, in __init__
    self.root = build(X, Y, 0)
                ^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/ml_codes/TPZ.py", line 426, in build
    right=build(XD[wR], YD[wR], depth + 1))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/ml_codes/TPZ.py", line 426, in build
    right=build(XD[wR], YD[wR], depth + 1))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/ml_codes/TPZ.py", line 426, in build
    right=build(XD[wR], YD[wR], depth + 1))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [Previous line repeated 982 more times]
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/ml_codes/TPZ.py", line 416, in build
    td, sp, tvar = best_split(XD[:, myD], YD, minleaf)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/ml_codes/TPZ.py", line 57, in best_split
    Ssplit, vsplit = split_point(values, y_split, minleaf)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/ml_codes/TPZ.py", line 29, in split_point
    stemp = np.var(yvals[0:nin]) * nin + np.var(yvals[nin:]) * (Ntot - nin)
            ^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 3787, in var
    return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/numpy/core/_methods.py", line 162, in _var
    with _no_nep50_warning():
         ^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/contextlib.py", line 289, in helper
    return _GeneratorContextManager(func, args, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/contextlib.py", line 105, in __init__
    self.gen = func(*args, **kwds)
               ^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded

Test file train3.hdf5 is attached. train3.hdf5.gz

sschmidt23 commented 2 months ago

It looks like you are hitting the python recursion limit for function calls, which I think is 999, so that's a pretty deep tree. It also looks like you're running with the "native" TPZ, which uses the custom tree code originally written for TPZ, which does define the tree recursively, and is not very optimized, and can be extremely slow to set up the random forest model. I think the two options are: 1) up the python recursion limit with something like

import sys
sys.setrecursionlimit(2000) #or some other higher number than 999

2) Switch to the sklearn tree strategy in TPZ that I added by setting the config param tree_strategy='sklearn' which will use sklearn's DecisionTreeRegressor rather than the custom recursive functions in TPZ. This is 5000-10,000 times faster to train, and should produce similar results (it averages over the objects in the terminal leaf node rather than returning all of them in the redshift histogram, so you get smoother PDFs with less bumps as a result). If you are trying to duplicate runs of the "original" TPZ code then you obviously can't do this, though, as it will produce slightly different results.

hdante commented 2 months ago

Hello, Sam, no, I don't think that's it, I've tried an execution with the recursion limit set to 100.000 and it resulted in the same error: maximum recursion depth exceeded. It either means that the tree height has exploded, or there's an infinite loop.

More importantly is knowing the bound of the recursion, based on the input data: how does the height of the tree varies, so that I can choose a suitable value for the recursion limit without having to keep increasing it later ?

I can't switch the tree strategy algorithm, or more specifically, I'm just writing the script, it's up to the researchers choose their desired configuration even if I choose a default one (unless the goal is for the algorithms to converge to the same result, or there's an explicit goal of replacing the original TPZ algorithm with the sklearn one).

sschmidt23 commented 2 months ago

Hmm, it could be a bug in the code, the tree code in TPZ is custom written and not very optimized, though I have run it quite a few times and never encountered an error like this before. There are several configuration parameters that partially control the building of the trees: minleaf sets the minimum number of galaxies in the terminal leaf of the tree, so increasing minleaf should make for a shallower tree (default value is 5). natt sets the number of attributes that the tree splits on, so you can use this to adjust how the splits are made.

How big is your training set for the inform stage? I have noticed that the native version of the code is extremely inefficient at training the trees for larger datasets, I trained a tree with 500,000 objects using ugrizy and the five colors u-g, g-r, etc..., and it took something like 51 hours to train a forest of 100 trees even when running in parallel on five processors (I do not remember checking the depth of the resulting trees, though). If your dataset is >500,000 objects that might be contributing to this recursion limit error. The only other potential issue that I can think of is unexpected values due to non-detections, negative fluxes, etc..., are there any np.inf or np.nan values in your dataset, or non-detections that are not properly handled?

Unfortunately the original TPZ did not include anything like a "max_depth" parameter to truncate the tree depth. I thought about adding one, but decided not to add that functionality after discovering how much more efficient the sklearn trees are. If that would be useful I can look into adding it as a feature, but just like using a different tree code via sklearn, it would no longer be quite the same as the original TPZ code.

hdante commented 2 months ago

Hello, the training set used was small, with 1561 objects. It was attached when opening the issue (train3.hdf5). It's a slice of a training set that comes from the LSST DP0.2 dataset. In principle all the parameters you mentioned should be ok already.

There are many NaN values in the dataset. They are in the magnitude columns, I think they come from too low flux values in the original dataset.

I'm not sure about changing the original algorithm, it depends more on the users. I can ask the LineA team about it. It makes sense to have some kind of bound that relates the size of the input data to a max depth. In any case, I don't think the problem in this small dataset is supposed to be anything related to it being too big.

hdante commented 1 month ago

Hello, any updates here ?

sschmidt23 commented 1 month ago

No updates, I have not had time to look into this particular issue, and I am on vacation for the next week. I will not be able to work on it until next week, but if you can send me the dataset (maybe via Slack DM) that is causing the issue then I can try to trace through and find the problem when I get back.

hdante commented 1 month ago

Hello, Sam, another person independently verified that the training input attached to this issue is causing infinite recursion. Do you have any updates ?

sschmidt23 commented 1 month ago

Hi Henrique, I grabbed the train3.hdf5 file and ran it in a quick notebook today and the model trained just fine for me with 5 bootstraps * 3 random realizations (so 15 trees), no infinite recursion. I see that a very large fraction of the magnitudes and magnitude errors are np.nan, e.g. 1183 out of 1561 mag_g values are np.nan, and similar numbers in the other bands. This is not really an ideal dataset to train on, with such a large fraction of the data not detected. I did set the nondetect_val to np.nan, and set the values to replace then NaNs with in each band to 30. The code runs, but gives terrible photo-z's for many galaxies (though most of those have many NaNs, so maybe not expected).

I'll take a closer look tomorrow, write up a little notebook and send it to you on Slack to see if you still get the recursion error with my notebook, as I cannot reproduce your error.

sschmidt23 commented 1 month ago

So looking at the data file that was giving you the recursion error, it appears that in each band, if the magnitude error is larger than 0.2 then the magnitude and error both get assigned np.nan. This means that many of the galaxies have multiple np.nan values, and there are really only a few hundred bright galaxies with good data detections in all bands. I'm still not sure if the recursion error was being caused by not setting the nondetect_val=np.nan or not, but that is my suspicion. Can you try running the following notebook and see if you get the recursion error? I'm running python3.10 and numpy v1.26.4 on my laptop, where I can produce (bad) photo-z estimates with no problems. Note that the train3.hdf5 file should be in the same directory as the notebook, or you can modify the path. TPZ_recursion_test.ipynb.gz

sschmidt23 commented 1 month ago

Actually, I forgot to just try setting nondetect_val to something other than np.nan, I just did that and did reproduce your recursion error. So, I'm pretty sure that is your problem: you can't have np.nan in the tree, as the tree logic gets confused by the NaNs and keeps splitting and splitting. If you replace the NaNs either as a preprocessing, or by setting nondetect_val to np.nan (which will replace those values with the limiting magnitudes in the mag_limits dictionary), then the tree should behave properly.

hdante commented 1 month ago

Sam, the test file has no scientific significance, it was randomly sliced from a larger dataset to debug this issue. I'll test again with the configuration of nondetect_val soon and return to you. Thanks,

hdante commented 2 weeks ago

Hello, Sam, for now I haven't been able to confirm that changing nondetect_val to nan completely solves the problem. I'm seeing both executions that fully work and some that don't. I'll continue testing.

hdante commented 1 week ago

Hello, Sam, I've just submitted 5000 executions of the training script and I'm seeing executions finishing with success and others still have the same recursion error: maximum recursion depth exceeded. The test file is the same already attached, train3.hdf5, which is a small test file. I'm setting nondetect_val=nan. Other than changing the column names, I'm running with default parameters, which mean 40 trees for 8 random realizations * 5 bootstraps. I can add some debug to the TPZ code to actually dump the received parameters if necessary. Right now, my evaluation is that adding the non-detect parameter improved the execution, but not all of them. Can you reproduce this problem ? Thanks,

hdante commented 22 hours ago

Hello, Sam, could you run this large test to confirm the problem ?

sschmidt23 commented 11 hours ago

Hi Henrique, I am at the DESC conference in Zurich this week, I'll try to set aside some time on Friday to look into this, but I won't have time before then.