prediction yoda file has different range for histograms

smrenna commented 1 year ago

Tuning/minimization appears to work, but the prediction yoda file has different properties than the reference data.

I am reproducing results from Julia Yarba who is using root-extracted yoda files. I thought the issue was with the structure of these yoda files (the naming), but changing that did not make a difference.

I can share the data and mc runs (it is not large), just say where.

Here is an example of the comparisons.

smrenna commented 1 year ago

If, in prediction2YODA, I make the following change, the histograms all line up:

    with open(fvals) as f:
        import json
        rd = json.load(f)
        xmin = np.array(rd["__xmin"])
        xmax = np.array(rd["__xmax"])
        keys = list( rd.keys() )
        hids=np.array([b.split("#")[0] for b in keys])

However, the approximation file does not give a good representation of the data. That makes me wonder if there isn't also some mismatch in app-build. However, it is not clear to me how the polynomial fit depends upon "x" values.

ojinoo commented 1 year ago

The problem is that vals = app.AppSet(fvals) sorts the approximations with app.tools.sorted_nicely after reading from the fvals json, but the __xmin/__xmax from the json are loaded without applying the same sorting.

The fix that works for version 1.0.7 from pip is to apply the same permutation to the bin edges:

    with open(fvals) as f:
        import json
        rd = json.load(f)
        xmin = np.array(rd["__xmin"])
        xmax = np.array(rd["__xmax"])

    ids_nicesort = vals._binids
    ids_likefile = [x for x in rd.keys() if not x.startswith("__")]
    likefile2nicesort = [ids_likefile.index(x) for x in ids_nicesort]
    xmin = xmin[likefile2nicesort]
    xmax = xmax[likefile2nicesort]

smrenna commented 1 year ago

Hi, @ojinoo are you saying there is a version 1.0.7 (not in this repository, because I don't see that tag) that already has the fix in it? I had found a similar solution, but had not committed anything yet. However, if there is another version out there, it would be good to synchronize it with the gitlab. thanks.

smrenna commented 1 year ago

@ojinoo This was my solution:

     hids=np.array([b.split("#")[0] for b in vals._binids])
-    hnames = sorted(set(hids))
+#   the following will remove duplicates but preserve the order
+    hnames=list(dict.fromkeys(hids))
     observables = sorted([x for x in set(app.io.readObs(wfile)) if x in hnames]) if wfile is not None else hnames

     with open(fvals) as f:
@@ -440,14 +441,21 @@ def prediction2YODA(fvals, Peval, fout="predictions.yoda", ferrs=None, wfile=Non
         rd = json.load(f)
         xmin = np.array(rd["__xmin"])
         xmax = np.array(rd["__xmax"])
+# The order of the keys in the JSON read are not set
+    analysisIds=np.array([b.split("#")[0] for b in list(rd.keys())])

     DX = (xmax-xmin)*0.5
     X  = xmin + DX
     Y2D = []
+# X and Y are not guaranteed to be in the same order
     import yoda
+    start = 0
     for obs in observables:
-        idx = np.where(hids==obs)
-        P2D = [yoda.Point2D(x,y,dx,dy) for x,y,dx,dy in zip(X[idx], Y[idx], DX[idx], dY[idx])]
+        idx = np.where(analysisIds==obs)
+        strand = np.size(idx)
+        jdx = np.arange(start,start+strand)
+        start = start + strand
+        P2D = [yoda.Point2D(x,y,dx,dy) for x,y,dx,dy in zip(X[idx], Y[jdx], DX[idx], dY[jdx])]
         Y2D.append(yoda.Scatter2D(P2D, obs, obs))
     yoda.write(Y2D, fout)

but yours would be preferable if it does the same thing in less code.

ojinoo commented 1 year ago

Hi @smrenna, the fix is not in 1.0.7 (https://pypi.org/project/pyapprentice/1.0.7/, the newest version is https://pypi.org/project/pyapprentice/1.1.0/, but that is also buggy for me). It is my own solution.

I have now forked this github version and I'm also using it for my work, so that I can add bugfixes when I get inconsistencies. My fork is at https://github.com/ojinoo/apprentice.

HEPonHPC / apprentice

prediction yoda file has different range for histograms #10