FAST-HEP / fast-carpenter

Helping turn your trees into tables (ie. reads ROOT TTrees, writes summary Pandas DataFrames)
https://fast-hep.web.cern.ch
Other
9 stars 14 forks source link

When `fast_carpenter` installed from source, unable to interpret branch as Python type #54

Closed asnaylor closed 5 years ago

asnaylor commented 5 years ago

On a clean virtualenv I was unable to execute a simple test yaml file when installing fast_carpenter from source. The yaml file works fine with fast_carpenter installed from Pypi in this clean virtualenv:

stages:
    - cuts: fast_carpenter.CutFlow
    - s1logs2: fast_carpenter.BinnedDataframe

cuts:
    selection:
        All:
            - singleScatters.nSingleScatters > 0

s1logs2:
    binning:
        - {in: singleScatters.s1Area_phd, out: s1}
        - {in: singleScatters.s2Area_phd, out: s2}

But when i install fast_carpenter from source via:

git clone https://github.com/FAST-HEP/fast-carpenter.git
cd fast-carpenter
python setup.py develop

I get this error message when trying to run ValueError: cannot interpret branch 'singleScatters.' as a Python type:

<fast-carpenter-dev> anaylor@hep17 | fast-carpenter-fork ⑂master* $ fast_carpenter test_dataset.yml test.yml                                                               [70/2398]
/home/anaylor/.local/lib/python2.7/site-packages/fast_curator/read.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Pl
ease read https://msg.pyyaml.org/load for full details.
  datasets_dict = yaml.load(f)
2019-07-11 12:30:02,244 - alphatwirl.concurrently.CommunicationChannel0 - WARNING - alphatwirl.concurrently.CommunicationChannel0.CommunicationChannel0.__init__(): the option "prog
ressbar" is deprecated. "progressbar=True" is given. use atpbar.disable() instead to turn off progress bars
WARNING:alphatwirl.concurrently.CommunicationChannel0:alphatwirl.concurrently.CommunicationChannel0.CommunicationChannel0.__init__(): the option "progressbar" is deprecated. "progr
essbar=True" is given. use atpbar.disable() instead to turn off progress bars
   0.00%                                          |        0 /        1 |:  test
Traceback (most recent call last):
  File "/home/anaylor/fast-carpenter-dev/bin/fast_carpenter", line 9, in <module>
    load_entry_point('fast-carpenter==0.12.0', 'console_scripts', 'fast_carpenter')()
  File "/home/anaylor/lzsim/fast-carpenter/fast_carpenter/__main__.py", line 72, in main
    _, ret_val = run_carpenter(sequence, datasets, args)
  File "/home/anaylor/lzsim/fast-carpenter/fast_carpenter/__main__.py", line 90, in run_carpenter
    ret_val = process.run(datasets, sequence)
  File "/home/anaylor/.local/lib/python2.7/site-packages/atuproot/atuproot_main.py", line 59, in run
    result = self._run(loop)
  File "/home/anaylor/.local/lib/python2.7/site-packages/atuproot/atuproot_main.py", line 121, in _run
    result = loop()
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/datasetloop/loop.py", line 30, in __call__
    self.reader.read(dataset)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/datasetloop/reader.py", line 27, in read
    reader.read(dataset)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/EventDatasetReader.py", line 66, in read
    runids = self.eventLoopRunner.run_multiple(eventLoops)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/MPEventLoopRunner.py", line 93, in run_multiple
    return self.communicationChannel.put_multiple(eventLoops)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/concurrently/CommunicationChannel0.py", line 50, in put_multiple
    task_idx = self.put(t)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/concurrently/CommunicationChannel0.py", line 37, in put
    result = task(*args, **kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/EventLoop.py", line 45, in __call__
    self.reader.event(event)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/ReaderComposite.py", line 43, in event
    if reader.event(event) is False:
  File "/home/anaylor/lzsim/fast-carpenter/fast_carpenter/summary/binned_dataframe.py", line 169, in event
    data = chunk.tree.pandas.df(all_inputs)
  File "/home/anaylor/lzsim/fast-carpenter/fast_carpenter/masked_tree.py", line 27, in df
    df = self._owner.tree.pandas.df(*args, **kwargs)
  File "/home/anaylor/lzsim/fast-carpenter/fast_carpenter/tree_wrapper.py", line 70, in df
    df = self._owner.tree.pandas.df(*args, **kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/uproot/_connect/_pandas.py", line 30, in df
    return self._tree.arrays(branches=branches, outputtype=pandas.DataFrame, namedecode=namedecode, entrystart=entrystart, entrystop=entrystop, flatten=flatten, flatname=flatname,
awkwardlib=awkwardlib, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking)
  File "/home/anaylor/lzsim/fast-carpenter/fast_carpenter/tree_wrapper.py", line 53, in arrays
    return self.tree.old_arrays(*args, **kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/uproot/tree.py", line 447, in arrays
    branches = list(self._normalize_branches(branches, awkward))
  File "/home/anaylor/.local/lib/python2.7/site-packages/uproot/tree.py", line 763, in _normalize_branches
    raise ValueError("cannot interpret branch {0} as a Python type\n   in file: {1}".format(repr(branch.name), self._context.sourcepath))
ValueError: cannot interpret branch 'singleScatters.' as a Python type
   in file: /home/anaylor/lzsim/fast-carpenter-fork/lz_2017040100_lzap.root

These are my pip libraries:

alphatwirl==0.25.2
atomicwrites==1.3.0
atpbar==1.0.3
atsge==0.1.11
attrs==19.1.0
atuproot==0.1.13
awkward==0.10.3
Babel==0.9.6
cachetools==3.1.1
configparser==3.7.4
contextlib2==0.5.5
coverage==4.5.3
docutils==0.11
entrypoints==0.3
enum34==1.1.6
-e git+git@github.com:FAST-HEP/fast-carpenter.git@63c3792108429545e6a4c3e660088c7907bc9999#egg=fast_carpenter-master
fast-curator==0.2.1
fast-flow==0.2.1
fast-plotter==0.2.1
flake8==3.7.8
funcsigs==1.0.2
functools32==3.2.3.post2
importlib-metadata==0.18
Jinja2==2.6
llvmlite==0.28.0
mantichora==0.9.5
MarkupSafe==0.11
mccabe==0.6.1
more-itertools==5.0.0
nose==1.3.7
numba==0.43.1
numexpr==2.6.9
numpy==1.16.4
pandas==0.24.2
pathlib2==2.3.4
Pillow==6.0.0
pluggy==0.12.0
py==1.8.0
pycodestyle==2.5.0
pyflakes==2.1.1
Pygments==1.5
pytest==4.3.0
pytest-cov==2.6.1
pytest-runner==5.1
python-dateutil==2.7.0
pytz==2019.1
PyYAML==5.1.1
scandir==1.10.0
scipy==0.12.1
simplejson==3.2.0
singledispatch==3.4.0.3
six==1.12.0
Sphinx==1.1.3
SQLAlchemy==0.7.9
tables==3.5.2
tornado==5.0.1
typing==3.7.4
uproot==3.6.1
uproot-methods==0.6.1
virtualenv==13.1.0
Werkzeug==0.8.3
wheel==0.24.0
zipp==0.5.1
benkrikler commented 5 years ago

Can you try running the unit tests on your source install? You need to have pytest installed (you can get it from pip) and then you can run pytest -vv tests/ in the top directory of the repo. Hopefully that will tell us more clearly where the error is coming from.

asnaylor commented 5 years ago

Hmm, All 52 tests passed

asnaylor commented 5 years ago

I tested the build from source fast_carpenter with the files and config from the fast_cms_tutorial and it works fine so it must just be this file structure in test_dataset.yml i am using. It's strange because the Pypi version works fine with test_dataset.yml but the build from source has an issue.

benkrikler commented 5 years ago

Hi @asnaylor. Do you think you could test this again with the latest version of carpenter (v0.13.0) or update your source install? I wonder if this might have been a consequence of a bug solved with PR #60, in which case this might have been solved by that as well.

asnaylor commented 5 years ago

Hi @benkrikler I tested the same files and config with the latest version from pip (v0.13.0) and now the error has changed to be pandas/core/groupby/groupby.py", line 3291, in _get_grouper raise KeyError(gpr) KeyError: 'singleScatters.s2Area_phd'.

To fix that i then added into config a fast_carpenter.Define stage where i apply this formula reduxS2: {reduce: 0, formula: singleScatters.s2Area_phd} and instead i bin reduxS2 instead of singleScatters.s2Area_phd and that works fine.

What's bizarre is that just binning singleScatters.s2Area_phd (which is a vector of floats but always size 1) has worked previously but now doesn't.

benkrikler commented 5 years ago

Can you post the full traceback, so I can understand better where this comes from? I think the issue comes from a feature that was added in v0.12.0 which allows you to calculate variables directly in the binned dataframe stage without defining a variable first. My guess is that full stop in the branch name confuses the way the expression parsing is handled there.

Also, to be clear, defining reduxS2: {reduce: 0, formula: singleScatters.s2Area_phd} isn't strictly the same thing since it's only looking t the first single scatter in each event, whereas binning on singleScatters.s2Area_phd would (if it wasn't breaking) look at all single scatters in every event. However, if my suspicion is right, then producing a new variable without a full-stop in its name but with the full contents of the original variable should be a valid work around:

- singleScatters__s2Area_phd: singleScatters.s2Area_phd
asnaylor commented 5 years ago

Sure, he's the key error full traceback:

Traceback (most recent call last):
  File "/opt/rh/python27/root/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/anaylor/.local/lib/python2.7/site-packages/mantichora/worker.py", line 27, in run
    self._run_tasks()
  File "/home/anaylor/.local/lib/python2.7/site-packages/mantichora/worker.py", line 47, in _run_tasks
    result = task_func()
  File "/home/anaylor/.local/lib/python2.7/site-packages/mantichora/main.py", line 18, in __call__
    return self.task(*self.args, **self.kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/concurrently/CommunicationChannel.py", line 16, in __call__
    return self.task(*self.args, **self.kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/EventLoop.py", line 45, in __call__
    self.reader.event(event)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/ReaderComposite.py", line 43, in event
    if reader.event(event) is False:
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/fast_carpenter/summary/binned_dataframe.py", line 195, in event
    out_dimensions=self._out_bin_dims)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/fast_carpenter/summary/binned_dataframe.py", line 240, in _bin_values
    bins = data.groupby(final_bin_dims)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/pandas/core/generic.py", line 7632, in groupby
    observed=observed, **kwargs)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/pandas/core/groupby/groupby.py", line 2110, in groupby
    return klass(obj, by, **kwds)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/pandas/core/groupby/groupby.py", line 360, in __init__
    mutated=self.mutated)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/pandas/core/groupby/grouper.py", line 578, in _get_grouper
    raise KeyError(gpr)

Yeah using the formula isn't the same but as it's a vector of size one it's okay for this particular example, however when i add a new definition without the formula like you suggested above i got an IndexError:

Traceback (most recent call last):
  File "/opt/rh/python27/root/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/anaylor/.local/lib/python2.7/site-packages/mantichora/worker.py", line 27, in run
    self._run_tasks()
  File "/home/anaylor/.local/lib/python2.7/site-packages/mantichora/worker.py", line 47, in _run_tasks
    result = task_func()
  File "/home/anaylor/.local/lib/python2.7/site-packages/mantichora/main.py", line 18, in __call__
    return self.task(*self.args, **self.kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/concurrently/CommunicationChannel.py", line 16, in __call__
    return self.task(*self.args, **self.kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/EventLoop.py", line 45, in __call__
    self.reader.event(event)
  File "/home/anaylor/.local/lib/python2.7/site-packages/alphatwirl/loop/ReaderComposite.py", line 43, in event
    if reader.event(event) is False:
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/fast_carpenter/summary/binned_dataframe.py", line 189, in event
    data = chunk.tree.pandas.df(all_inputs)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/fast_carpenter/masked_tree.py", line 27, in df
    df = self._owner.tree.pandas.df(*args, **kwargs)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/fast_carpenter/tree_wrapper.py", line 70, in df
    df = self._owner.tree.pandas.df(*args, **kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/uproot/_connect/_pandas.py", line 30, in df
    return self._tree.arrays(branches=branches, outputtype=pandas.DataFrame, namedecode=namedecode, entrystart=entrystart, entrystop=entrystop, flatten=flatten, flatname=flatname, awkwardlib=awkwardlib, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking)
  File "/home/anaylor/fast-carpenter-dev/lib/python2.7/site-packages/fast_carpenter/tree_wrapper.py", line 53, in arrays
    return self.tree.old_arrays(*args, **kwargs)
  File "/home/anaylor/.local/lib/python2.7/site-packages/uproot/tree.py", line 484, in arrays
    return wait()
  File "/home/anaylor/.local/lib/python2.7/site-packages/uproot/tree.py", line 468, in wait
    return uproot._connect._pandas.futures2df(futures, outputtype, entrystart, entrystop, flatten, flatname, awkward)
  File "/home/anaylor/.local/lib/python2.7/site-packages/uproot/_connect/_pandas.py", line 192, in futures2df
    indexes = awkward.JaggedArray(starts, stops, awkward.numpy.empty(stops[-1], dtype=object)).tojagged(indexes).content
  File "/home/anaylor/.local/lib/python2.7/site-packages/awkward/array/jagged.py", line 780, in tojagged
    content[good] = data[self.parents[good]]
IndexError: index 1690 is out of bounds for axis 0 with size 1676