LSSTDESC / rail_sklearn

RAIL algorithms that depend on scikit-learn.
MIT License
1 stars 0 forks source link

NZDir fails if data is larger than chunk_size #11

Closed sschmidt23 closed 6 months ago

sschmidt23 commented 6 months ago

Irene Moskowitz messaged me pointing out that NZDir was failing for a dataset that she was attempting to run, returning a qp ensemble with NaN for every entry. I ran the demo notebook and it ran fine for the default data, but failed in the way Irene described when I used a larger dataset of my own. I noticed that the demo notebook has three samples all smaller than the default chunk_size of 10,000, if I set chunk_size=1000 then the demo notebook fails with the error:

/Users/sam/anaconda3/envs/xtpz/lib/python3.10/site-packages/qp/hist_pdf.py:80: RuntimeWarning: invalid value encountered in divide
  self._hpdfs = (pdfs_2d.T / sums).T

So, it appears that there is a bug somewhere in code, likely in the join_histogram function that merges the chunks in the end.

sschmidt23 commented 6 months ago

It looks like the normalization is not being tracked properly with multiple chunks, the ancillary data overwrites the normalization each time so that only the first set of M (for M chunks of chunk_data) has values, and the rest are all zeros. Not sure if this is the only problem, but it's at least one problem.

sschmidt23 commented 6 months ago

Actually, it may be a problem in how the ancillary data is being added to the partial ensembles, as it looks like the normalization is being computed in each process_chunk chunk, and those ensemble data are being added, but the normalization ancil only writes out the ancillary data for that chunk (which makes sense, as each chunk only knows about itself).