CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffea-hep.readthedocs.io
BSD 3-Clause "New" or "Revised" License
134 stars 129 forks source link

`bytesread` in metrics varies depending on file source and disagrees with pure `uproot` #717

Open alexander-held opened 2 years ago

alexander-held commented 2 years ago

Describe the bug The bytesread metric changes when processing a local file or a file read through https. It also differs from what pure uproot reports.

To Reproduce

import urllib.request
from coffea import processor
import uproot

file_local = "data.root"
file_remote = "https://xrootd-local.unl.edu:1094//store/user/AGC/datasets/"\
    "RunIIFall15MiniAODv2/TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM/"\
    "PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1/00000/"\
    "00DF0A73-17C2-E511-B086-E41D2D08DE30.root"

# download file
urllib.request.urlretrieve(file_remote, file_local)

class TtbarAnalysis(processor.ProcessorABC):
    def process(self, events):
        events["jet_pt"]**2
        events["jet_eta"]**2
        events["jet_phi"]**2
        return {}

    def postprocess(self, accumulator):
        return accumulator

# coffea with local file + https
for fileset, method in zip([{"ttbar": [file_local]}, {"ttbar": [file_remote]}], ["local", "https"]):
    executor = processor.IterativeExecutor()
    run = processor.Runner(executor=executor, savemetrics=True)
    _, metrics = run(fileset, "events", processor_instance=TtbarAnalysis())
    print(f"data read (coffea {method}): {metrics['bytesread']/1000**2} MB")

# uproot with local file + https
for filename, method in zip([file_local, file_remote], ["local", "https"]):
    f = uproot.open(filename)
    f['events'].arrays(["jet_pt", "jet_eta", "jet_phi"])
    print(f"data read (uproot {method}): {f.file.source.num_requested_bytes/1000**2} MB")

Expected behavior All four numbers should match.

Output

Preprocessing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:00 < 0:00:00 | ? file/s ]
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:00 < 0:00:00 | ? chunk/s ]
data read (coffea local): 9.887013 MB
Preprocessing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:03 < 0:00:00 | ? file/s ]
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:11 < 0:00:00 | ? chunk/s ]
data read (coffea https): 1.704775 MB
data read (uproot local): 5.001088 MB
data read (uproot https): 5.001088 MB

Desktop (please complete the following information): coffea 0.7.16, uproot 4.3.3

Additional context n/a

nsmith- commented 1 year ago

Coffea's bytesread is the same as uproot https://github.com/CoffeaTeam/coffea/blob/b14672ef969fa0ee0e4ad150936bc6465dbde7bd/coffea/processor/executor.py#L1649 so there must be some issue with how and when we are accessing this information. Shared source object?

lgray commented 10 months ago

Do we still care about this @alexander-held @nsmith-. I think we narrowed this down to retries / re-requesting data when done over xrootd/http, so data read is not deterministic.