Closed bockjoo closed 2 weeks ago
@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.
I'll note that I've been seeing this with the occasional file opened via xrootd. One specific example: the /DoubleMuon/Run2016F*NanoAODv9-v1/NANOAOD file (there's just one, about 2.1GB, which I'm still investigating because it seems to open fine in uproot from wisconsin, but whichever is being picked up by the datadiscoverycli with round-robin replica choice is triggering this error... and also the first option is the T1_US_FNAL disks which are under maintenance today)
Here's a single-file CMS dataset for which many replicas fail:
Sites availability for dataset: /DoubleMuon/Run2016F-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD
Available replicas
┏━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Index ┃ Site ┃ Files ┃ Availability ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ 0 │ T1_US_FNAL_Disk │ 1 / 1 │ 100.0% │
│ 1 │ T2_DE_DESY │ 1 / 1 │ 100.0% │
│ 2 │ T2_CH_CSCS │ 1 / 1 │ 100.0% │
│ 3 │ T1_DE_KIT_Disk │ 1 / 1 │ 100.0% │
│ 4 │ T3_KR_KISTI │ 1 / 1 │ 100.0% │
│ 5 │ T2_IT_Legnaro │ 1 / 1 │ 100.0% │
│ 6 │ T2_US_Wisconsin │ 1 / 1 │ 100.0% │
│ 7 │ T2_BE_IIHE │ 1 / 1 │ 100.0% │
│ 8 │ T1_RU_JINR_Disk │ 1 / 1 │ 100.0% │
│ 9 │ T3_US_NotreDame │ 1 / 1 │ 100.0% │
│ 10 │ T3_IT_Trieste │ 1 / 1 │ 100.0% │
│ 11 │ T2_DE_RWTH │ 1 / 1 │ 100.0% │
│ 12 │ T3_KR_UOS │ 1 / 1 │ 100.0% │
└───────┴─────────────────┴───────┴──────────────┘
This code should permit seeing the failure in action:
from coffea.dataset_tools import preprocess
run2016f = {
"0": {"files": {
"root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"1": {"files": {
"root://dcache-cms-xrootd.desy.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"2": {"files": {
"root://storage01.lcg.cscs.ch:1096//pnfs/lcg.cscs.ch/cms/trivcat/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"3": {"files": {
"root://cmsdcache-kit-disk.gridka.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"4": {"files": {
"root://cms-xrdr.sdfarm.kr:1094//xrd/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"5": {"files": {
"root://t2-xrdcms.lnl.infn.it:7070//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"6": {"files": {
"root://cmsxrootd.hep.wisc.edu:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"7": {"files": {
"root://maite.iihe.ac.be:1095//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"8": {"files": {
"root://xrootd01.jinr-t1.ru:1094//pnfs/jinr-t1.ru/data/cms/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"9": {"files": {
"root://deepthought.crc.nd.edu//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"10": {"files": {
"root://cmsxrd.ts.infn.it:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"11": {"files": {
"root://grid-cms-xrootd.physik.rwth-aachen.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
"12": {"files": {
"root://cms.sscc.uos.ac.kr:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
}
for key in run2016f:
try:
preprocess({key: run2016f[key]}, recalculate_steps=True, files_per_batch=10, save_form=True)
except:
print(key, "FAILED")
Output for me right now:
0 FAILED [Disk downtime at FNAL today, though]
2 FAILED [T2_CH_CSCS]
5 FAILED [T2_IT_Legnaro]
12 FAILED [T3_KR_UOS]
Hi,
I also encountered the same issue for two datasets with many files. I have tried adding IndexError
to the file_exceptions
option in ddc.do_preprocess
. Unfortunately, the error is still not caught. I am guessing that it's because the error is raised by awkward
. Has there been any new fix to skip the problematic files?
It means that no sites returned a valid list of files when trying to establish their existence.
@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.
I think I was reading a file from root:// protocol, not https, with skip_bad_files=True
and the preprocess failed to open/read the file from FNAL.
When uproot raises an exception, it does not provide the file name and the reason for error, which should be added to make the error clearer, e.g., when raising OSErr in fsspec_xrootd/xrootd.py
It means that no sites returned a valid list of files when trying to establish their existence.
How could that happen when the error does not appear during ddc.load_dataset_definition
? I know for a fact that these files exist because I was using the generic root://cmsxrootd.fnal.gov/
redirector and was able to preprocess them in an older version of my code.
I thought the error was raised as long as there was one bad file.
At the moment, this fails:
xrdcp -d 1 -f root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root /dev/null
as is reported here, which I reported to a FNAL admin. Another option to open the file is
root://cms-xrd-global.cern.ch:1094//store/test/xrootd/T1_US_FNAL/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root
instead of
root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root
Normally, it's supposed to be accessed using
root://cms-xrd-global.cern.ch:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root
which will open the file from one of these sites:
T1_DE_KIT_Disk
T1_IT_CNAF_Disk
T1_RU_JINR_Disk
T1_US_FNAL_Disk
T2_BE_IIHE
T2_BE_UCL
T2_CH_CSCS
T2_DE_DESY
T2_DE_RWTH
T2_EE_Estonia
T2_FR_GRIF
T2_IT_Legnaro
T2_UK_London_IC
T2_US_Vanderbilt
T2_US_Wisconsin
Redirectors are known to be flakey for accessing files consistently, prior success unfortunately means you were only lucky. You should find where this file is located and use a concrete xrootd endpoint instead of a redirector.
This particular error happens when you try to slice an array that consists entirely of None
, which only happens when every single file you passed resulted in failure to access. Otherwise the fields that it is complaining about are all present and slicing will work as expected.
I'll make a PR that should at least report this outcome more clearly. I'll @ you and you can try it.
Could one of you please try #1168?
This fix produces the updated error msg.
Exception: There was no populated list of files returned from querying your input dataset.
Please check your xrootd endpoints, and avoid redirectors.
Input dataset: /ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22EENanoAODv12-130X_mcRun3_2022_realistic_postEE_v6-v2/NANOAODSIM
As parsed for querying: [{file: ..., ...}, {file: ..., ...}, ..., {file: ..., ...}, {file: ..., ...}]
If my dataset_definition
contains several datasets and only one of them is failing like this, would it be possible to save at least the successfully preprocessed results?
@JoyYTZhou I have added something to the PR that should give this functionality. Have a look in the PR and give it a try.
@lgray Now I do get the preprocessed result dumped to my terminal (I would've preferred it to be a json.gz), but it still only dumps the result for the dataset that has failed (which shows none in every field).
Based on the printed table index, that failed dataset was not the first to be processed, yet none of the previous results was dumped. If somehow the failed dataset was always picked to be run first then the preprocessed result wouldn't be useful. I could also delete the failed dataset from my query, that always works.
@JoyYTZhou
All of the passed results are returned as two dictionaries:
What gets dumped to the screen are only the failed runs, as a standard python user warning. They are not meant for manipulation by the user, only to tell you what went wrong. This is why it is not dumped to a json file, it would have no purpose, and copy/pasting is a user interface design choice that does not scale well. You can also find out which datasets failed by finding the dataset keys that are in your input fileset that are not in the output dictionary of successfully parsed results.
You may save or further process the returned dictionaries however you wish.
@JoyYTZhou have you been able to 1) pass allow_empty_datasets=True
to preprocess and then 2) access the successfully parsed datasets from what is returned by that function?
If you don't want to see the printout when the dataset fails you can use the control mechanisms available to you via https://docs.python.org/3/library/warnings.html.
@lgray
Yes, there's such an option, however since preprocess
is called by ddc.do_preprocess
, and that one is really what the user is recommended to use, there needs to be **kwargs
in do_preprocess
in DataDiscoveryCLI
so that I don't have to constantly go to src code to turn options on/off.
If the successfully parsed datasets are returned by preprocess
, then I should be able to see a json.gz
produced by do_preprocess
. I am not seeing that. I might use preprocess
directly to check, but that rather defeats the purpose of using DataDiscoveryCLI
.
https://github.com/CoffeaTeam/coffea/pull/1137/files this needs to be updated to add the extra arg
@lgray Yes, there's such an option, however since
preprocess
is called byddc.do_preprocess
, and that one is really what the user is recommended to use, there needs to be**kwargs
indo_preprocess
inDataDiscoveryCLI
so that I don't have to constantly go to src code to turn options on/off.If the successfully parsed datasets are returned by
preprocess
, then I should be able to see ajson.gz
produced bydo_preprocess
. I am not seeing that. I might usepreprocess
directly to check, but that rather defeats the purpose of usingDataDiscoveryCLI
.
@lgray Actually never mind, yes the results get dumped when allow_empty_datasets=True
in preprocess
. I would appreciate if do_preprocess
gets a kwargs
still.
Composability does not defeat the purpose of a shortcut.
I'll add allow_empty_datasets in do_preprocess.
My bad for missing you were using that as opposed preprocess directly.
OK added to the rucio utils. Please give it a try.
OK added to the rucio utils. Please give it a try.
Yes, I get the successful outputs now. Thank you. I think this fix closes the issue.
I am trying to preprocess a file at FNAL with Coffea2024.6.1, but got this exception:
It was unclear what went wrong.