CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffeateam.github.io/coffea/
BSD 3-Clause "New" or "Revised" License
128 stars 126 forks source link

Preprocessing a file at FNAL leads an unclear exception #1138

Closed bockjoo closed 2 weeks ago

bockjoo commented 1 month ago

I am trying to preprocess a file at FNAL with Coffea2024.6.1, but got this exception:

Traceback (most recent call last):
  File "/cmsuf/t2/operations/opt/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/submitFullDataset.py", line 1066, in <module>
    dataset_runnable, dataset_updated = preprocess(
                                        ^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/coffea/dataset_tools/preprocess.py", line 381, in preprocess
    processed_files_without_forms = processed_files[
                                    ^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/highlevel.py", line 1066, in __getitem__
    prepare_layout(self._layout[where]),
                   ~~~~~~~~~~~~^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/content.py", line 669, in _getitem
    return self._getitem_fields(list(where))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/indexedoptionarray.py", line 346, in _getitem_fields
    self._content._getitem_fields(where, only_fields),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/emptyarray.py", line 193, in _getitem_fields
    raise ak._errors.index_error(self, where, "not an array of records")
IndexError: cannot slice EmptyArray (of length 0) with ['file', 'object_path', 'steps', 'num_entries', 'uuid']: not an array of records

This error occurred while attempting to slice

    <Array [None, None] type='2 * ?unknown'>

with

    ['file', 'object_path', 'steps', 'num_entries', 'uuid']

It was unclear what went wrong.

lgray commented 1 month ago

@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.

NJManganelli commented 1 month ago

I'll note that I've been seeing this with the occasional file opened via xrootd. One specific example: the /DoubleMuon/Run2016F*NanoAODv9-v1/NANOAOD file (there's just one, about 2.1GB, which I'm still investigating because it seems to open fine in uproot from wisconsin, but whichever is being picked up by the datadiscoverycli with round-robin replica choice is triggering this error... and also the first option is the T1_US_FNAL disks which are under maintenance today)

NJManganelli commented 1 month ago

Here's a single-file CMS dataset for which many replicas fail:

Sites availability for dataset: /DoubleMuon/Run2016F-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD
                Available replicas                
┏━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Index ┃ Site            ┃ Files ┃ Availability ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│   0   │ T1_US_FNAL_Disk │ 1 / 1 │    100.0%    │
│   1   │ T2_DE_DESY      │ 1 / 1 │    100.0%    │
│   2   │ T2_CH_CSCS      │ 1 / 1 │    100.0%    │
│   3   │ T1_DE_KIT_Disk  │ 1 / 1 │    100.0%    │
│   4   │ T3_KR_KISTI     │ 1 / 1 │    100.0%    │
│   5   │ T2_IT_Legnaro   │ 1 / 1 │    100.0%    │
│   6   │ T2_US_Wisconsin │ 1 / 1 │    100.0%    │
│   7   │ T2_BE_IIHE      │ 1 / 1 │    100.0%    │
│   8   │ T1_RU_JINR_Disk │ 1 / 1 │    100.0%    │
│   9   │ T3_US_NotreDame │ 1 / 1 │    100.0%    │
│  10   │ T3_IT_Trieste   │ 1 / 1 │    100.0%    │
│  11   │ T2_DE_RWTH      │ 1 / 1 │    100.0%    │
│  12   │ T3_KR_UOS       │ 1 / 1 │    100.0%    │
└───────┴─────────────────┴───────┴──────────────┘

This code should permit seeing the failure in action:

from coffea.dataset_tools import preprocess
run2016f = {
    "0": {"files": {
                "root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "1": {"files": {        
                "root://dcache-cms-xrootd.desy.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "2": {"files": {          
                "root://storage01.lcg.cscs.ch:1096//pnfs/lcg.cscs.ch/cms/trivcat/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "3": {"files": {  
                "root://cmsdcache-kit-disk.gridka.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "4": {"files": {  
                "root://cms-xrdr.sdfarm.kr:1094//xrd/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "5": {"files": {  
                 "root://t2-xrdcms.lnl.infn.it:7070//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "6": {"files": {  
                "root://cmsxrootd.hep.wisc.edu:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "7": {"files": {  
                "root://maite.iihe.ac.be:1095//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "8": {"files": {  
                "root://xrootd01.jinr-t1.ru:1094//pnfs/jinr-t1.ru/data/cms/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "9": {"files": {  
                "root://deepthought.crc.nd.edu//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "10": {"files": {  
                "root://cmsxrd.ts.infn.it:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "11": {"files": {  
                "root://grid-cms-xrootd.physik.rwth-aachen.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "12": {"files": {  
                "root://cms.sscc.uos.ac.kr:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
}
for key in run2016f:
    try:
        preprocess({key: run2016f[key]}, recalculate_steps=True, files_per_batch=10, save_form=True)
    except:
        print(key, "FAILED")

Output for me right now:

0 FAILED [Disk downtime at FNAL today, though]
2 FAILED [T2_CH_CSCS]
5 FAILED [T2_IT_Legnaro]
12 FAILED [T3_KR_UOS]
JoyYTZhou commented 2 weeks ago

Hi,

I also encountered the same issue for two datasets with many files. I have tried adding IndexError to the file_exceptions option in ddc.do_preprocess. Unfortunately, the error is still not caught. I am guessing that it's because the error is raised by awkward. Has there been any new fix to skip the problematic files?

lgray commented 2 weeks ago

It means that no sites returned a valid list of files when trying to establish their existence.

bockjoo commented 2 weeks ago

@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.

I think I was reading a file from root:// protocol, not https, with skip_bad_files=True and the preprocess failed to open/read the file from FNAL.

bockjoo commented 2 weeks ago

When uproot raises an exception, it does not provide the file name and the reason for error, which should be added to make the error clearer, e.g., when raising OSErr in fsspec_xrootd/xrootd.py

JoyYTZhou commented 2 weeks ago

It means that no sites returned a valid list of files when trying to establish their existence.

How could that happen when the error does not appear during ddc.load_dataset_definition? I know for a fact that these files exist because I was using the generic root://cmsxrootd.fnal.gov/ redirector and was able to preprocess them in an older version of my code.

I thought the error was raised as long as there was one bad file.

bockjoo commented 2 weeks ago

At the moment, this fails:

xrdcp -d 1 -f root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root /dev/null

as is reported here, which I reported to a FNAL admin. Another option to open the file is

root://cms-xrd-global.cern.ch:1094//store/test/xrootd/T1_US_FNAL/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

instead of

root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

Normally, it's supposed to be accessed using

root://cms-xrd-global.cern.ch:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

which will open the file from one of these sites:

T1_DE_KIT_Disk
T1_IT_CNAF_Disk
T1_RU_JINR_Disk
T1_US_FNAL_Disk
T2_BE_IIHE
T2_BE_UCL
T2_CH_CSCS
T2_DE_DESY
T2_DE_RWTH
T2_EE_Estonia
T2_FR_GRIF
T2_IT_Legnaro
T2_UK_London_IC
T2_US_Vanderbilt
T2_US_Wisconsin
lgray commented 2 weeks ago

Redirectors are known to be flakey for accessing files consistently, prior success unfortunately means you were only lucky. You should find where this file is located and use a concrete xrootd endpoint instead of a redirector.

This particular error happens when you try to slice an array that consists entirely of None, which only happens when every single file you passed resulted in failure to access. Otherwise the fields that it is complaining about are all present and slicing will work as expected.

I'll make a PR that should at least report this outcome more clearly. I'll @ you and you can try it.

lgray commented 2 weeks ago

Could one of you please try #1168?

JoyYTZhou commented 2 weeks ago

This fix produces the updated error msg.

Exception: There was no populated list of files returned from querying your input dataset.
Please check your xrootd endpoints, and avoid redirectors.
Input dataset: /ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22EENanoAODv12-130X_mcRun3_2022_realistic_postEE_v6-v2/NANOAODSIM
As parsed for querying: [{file: ..., ...}, {file: ..., ...}, ..., {file: ..., ...}, {file: ..., ...}]

If my dataset_definition contains several datasets and only one of them is failing like this, would it be possible to save at least the successfully preprocessed results?

lgray commented 2 weeks ago

@JoyYTZhou I have added something to the PR that should give this functionality. Have a look in the PR and give it a try.

JoyYTZhou commented 2 weeks ago

@lgray Now I do get the preprocessed result dumped to my terminal (I would've preferred it to be a json.gz), but it still only dumps the result for the dataset that has failed (which shows none in every field).

Based on the printed table index, that failed dataset was not the first to be processed, yet none of the previous results was dumped. If somehow the failed dataset was always picked to be run first then the preprocessed result wouldn't be useful. I could also delete the failed dataset from my query, that always works.

lgray commented 2 weeks ago

@JoyYTZhou

All of the passed results are returned as two dictionaries:

What gets dumped to the screen are only the failed runs, as a standard python user warning. They are not meant for manipulation by the user, only to tell you what went wrong. This is why it is not dumped to a json file, it would have no purpose, and copy/pasting is a user interface design choice that does not scale well. You can also find out which datasets failed by finding the dataset keys that are in your input fileset that are not in the output dictionary of successfully parsed results.

You may save or further process the returned dictionaries however you wish.

lgray commented 2 weeks ago

@JoyYTZhou have you been able to 1) pass allow_empty_datasets=True to preprocess and then 2) access the successfully parsed datasets from what is returned by that function?

If you don't want to see the printout when the dataset fails you can use the control mechanisms available to you via https://docs.python.org/3/library/warnings.html.

JoyYTZhou commented 2 weeks ago

@lgray Yes, there's such an option, however since preprocess is called by ddc.do_preprocess, and that one is really what the user is recommended to use, there needs to be **kwargs in do_preprocess in DataDiscoveryCLI so that I don't have to constantly go to src code to turn options on/off.

If the successfully parsed datasets are returned by preprocess, then I should be able to see a json.gz produced by do_preprocess. I am not seeing that. I might use preprocess directly to check, but that rather defeats the purpose of using DataDiscoveryCLI.

ikrommyd commented 2 weeks ago

https://github.com/CoffeaTeam/coffea/pull/1137/files this needs to be updated to add the extra arg

JoyYTZhou commented 2 weeks ago

@lgray Yes, there's such an option, however since preprocess is called by ddc.do_preprocess, and that one is really what the user is recommended to use, there needs to be **kwargs in do_preprocess in DataDiscoveryCLI so that I don't have to constantly go to src code to turn options on/off.

If the successfully parsed datasets are returned by preprocess, then I should be able to see a json.gz produced by do_preprocess. I am not seeing that. I might use preprocess directly to check, but that rather defeats the purpose of using DataDiscoveryCLI.

@lgray Actually never mind, yes the results get dumped when allow_empty_datasets=True in preprocess. I would appreciate if do_preprocess gets a kwargs still.

lgray commented 2 weeks ago

Composability does not defeat the purpose of a shortcut.

I'll add allow_empty_datasets in do_preprocess.

My bad for missing you were using that as opposed preprocess directly.

lgray commented 2 weeks ago

OK added to the rucio utils. Please give it a try.

JoyYTZhou commented 2 weeks ago

OK added to the rucio utils. Please give it a try.

Yes, I get the successful outputs now. Thank you. I think this fix closes the issue.