CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffeateam.github.io/coffea/
BSD 3-Clause "New" or "Revised" License
128 stars 126 forks source link

Processing dataset with zero events #1141

Open alexander-held opened 1 month ago

alexander-held commented 1 month ago

Describe the bug When a dataset (out of dataset_tools.preprocess) has zero chunks to run over, dataset_tools.apply_to_fileset will raise FileNotFoundError on {}. I think this usually happens when some other more serious thing has gone wrong at first but I occasionally see people run into this and it's not immediately obvious what happened.

To Reproduce the following makes use of a fix for #1140 to run:

import uproot
import awkward as ak
from coffea import dataset_tools
from coffea.nanoevents import BaseSchema
import dask

with uproot.recreate("f1.root") as f:
   f["tree"] = {"arr": ak.Array([])}
with uproot.recreate("f2.root") as f:
   f["tree"] = {"arr": ak.Array([1])}

fileset = {"dummy": {"files": {"f1.root": "tree"}}}
# fileset = {"dummy": {"files": {"f1.root": "tree", "f2.root": "tree"}}}  # this works
samples, _ = dataset_tools.preprocess(fileset)
tasks = dataset_tools.apply_to_fileset(lambda evts: None, samples, schemaclass=BaseSchema)
_ = dask.compute(tasks)

Expected behavior A warning along the lines of "no useable files found for dataset xyz" and no exception raised by default.

Output

Traceback (most recent call last):
  File "[...]]/test.py", line 17, in <module>
    tasks = dataset_tools.apply_to_fileset(lambda evts: None, samples, schemaclass=BaseSchema)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/coffea/dataset_tools/apply_processor.py", line 125, in apply_to_fileset
    dataset_out = apply_to_dataset(
                  ^^^^^^^^^^^^^^^^^
  File "[...]/coffea/dataset_tools/apply_processor.py", line 73, in apply_to_dataset
    ).events()
      ^^^^^^^^
  File "[...]/coffea/nanoevents/factory.py", line 684, in events
    events = self._mapping(form_mapping=self._schema)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/uproot/_dask.py", line 183, in dask
    files = uproot._util.regularize_files(files, steps_allowed=True, **options)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/uproot/_util.py", line 946, in regularize_files
    raise _file_not_found(files)
FileNotFoundError: file not found

    {}

Files may be specified as:
   * str/bytes: relative or absolute filesystem path or URL, without any colons
         other than Windows drive letter or URL schema.
         Examples: "rel/file.root", "C:\abs\file.root", "http://where/what.root"
   * str/bytes: same with an object-within-ROOT path, separated by a colon.
         Example: "rel/file.root:tdirectory/ttree"
   * pathlib.Path: always interpreted as a filesystem path or URL only (no
         object-within-ROOT path), regardless of whether there are any colons.
         Examples: Path("rel:/file.root"), Path("/abs/path:stuff.root")

Functions that accept many files (uproot.iterate, etc.) also allow:
   * glob syntax in str/bytes and pathlib.Path.
         Examples: Path("rel/*.root"), "/abs/*.root:tdirectory/ttree"
   * dict: keys are filesystem paths, values are objects-within-ROOT paths.
         Example: {"/data_v1/*.root": "ttree_v1", "/data_v2/*.root": "ttree_v2"}
   * already-open TTree objects.
   * iterables of the above.

Desktop (please complete the following information): n/a

Additional context n/a

lgray commented 1 month ago

Technically the error is correct, if the list is empty it does not conform to that spec!

We can make a more clear error with some suggestions for improving the situation.