Closed ericvaandering closed 5 months ago
@ericvaandering , Can you give more details on this issue?
@yuyiguo @ericvaandering I guess I am the one who started this. Shall we try to have a quick chat on zoom ? I am generally/usually available in your early morning (8-10), But if you prefer that I expand here.. sure, let me know.
Yes, I am available today. Let me know when we can chat.
wel... today I had things to do. Let's plan it a little bit so that Eric may also join. 10 min should suffice
what about tomorrow in your 8-10am window ?
Adding some description after a chat with Eric and Yuyi (Katy was also there) @KatyEllis FYI
lib/rucio/client/
by which a replica (PFN) can be declared suspicious, and a (hopefully matching) declare_suspicious_file_replicas(self, pfns, reason) in lib/rucio/api
AVAILABLE
crab_server
or with crab_tape_recall
. Other daemons/processes which currently write to AMQ ? What would be the risk in granting to CRAB server the needed privilege ?api
...all proceeds via the same code (in core ? ). In the end, what exactly happens ?did this issue follow up somewhere else? Or is it just stale and we still need to validate @belforte proposal with the various tasks?
is still on my todo list and I am not tracking it elsewhere. I should. Then this can be put on hold until I have a proposal.
currently this breaks up as
This is on my to-do list too but in a low priority.
First thing is to flag jobs which hit corrupted files, and monitor it, so we quantify problem.
What is the definition of "suspicious replicas"? If a transfer failed, fts will try to retransfer it. If the failure is permanent, how Replica Recover daemon can fix it? Why CMSSW or processing jobs will read a suspicious replica?
Hi @yuyiguo . Let me try to list here what I (think that I) understand. Hopefully it answers some of your questions.
rucio download
of each files to the WN before reading, so Rucio will verify checksum there and IIUC automatically mark replica as suspicious. I have tested this myself [1] and the error is duly detected, but I think something else is needed to mark the replica as suspicious, Most likely there's something else going one when ATLAS jobs try to read files and we need to ask Dimitrios to be more explicit than in his talk mentioned in comment above. Anyhow we do not download files from storage to WN's.metadata
means here.Hope this helps !
[1]
belforte@lxplus701/belforte> rucio download cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root --rses T2_US_UCSD
2023-03-27 17:57:51,728 INFO Processing 1 item(s) for input
2023-03-27 17:57:51,847 INFO No preferred protocol impl in rucio.cfg: No section: 'download'
2023-03-27 17:57:51,848 INFO Using main thread to download 1 file(s)
2023-03-27 17:57:51,848 INFO Preparing download of cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:57:51,866 INFO Trying to download with davs and timeout of 4713s from T2_US_UCSD: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:57:51,936 INFO Using PFN: davs://redirector.t2.ucsd.edu:1095/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,713 WARNING Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,714 WARNING Download attempt failed. Try 1/2
2023-03-27 18:02:02,656 WARNING Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,657 WARNING Download attempt failed. Try 2/2
2023-03-27 18:02:02,670 ERROR Failed to download file cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,672 ERROR None of the requested files have been downloaded.
belforte@lxplus701/belforte>
btw, the file in my example above has been fixed by Felipe and download is now OK. https://mattermost.web.cern.ch/cms-o-and-c/pl/uw4mauamr3dxmf1aebpfsybcjo https://ggus.eu/?mode=ticket_info&ticket_id=160902
Some time ago I wrote this document with the proposal how we can handle this. Has anybody had a chance to read it ? Do we need to revisit that proposal ? I am a bit confused about the state of that proposal.
Basically here is my proposal:
I would like to get some feedback on this proposal. Perhaps we can add more details to it. Once we agree on the plan, we can work out the details.
thanks Igor, and apologies for not having replied earlier. I had indeed read your document and fully agree with it and the plan outlined above. I have in, and reasonably close to the top of my list, to do the 1. above in CRAB: https://github.com/dmwm/CRABServer/issues/7548
I would like to see this at work from automatic/automated tools for a while, before we think about enabling users, at that point we may have to introduce some way to "trust machines more than humans" IMHO.
One thing that I expect we can talk about later, but let me mention now: we can detect both missing and possibly corrupted files, and tell one from the other. Should we also flag the missing ones (i.e. clean open failures w/o a zip error or wrong checksum) as suspicious and sort of try to shortcut the CE ? I am a bit worried when looking at CE page by how many sites simply fail week after week to give any useful result, looks only half sites or so have a "done". I am not saying to abandon that effort, simplt complement it.
We can surely resume this once I have code which parses CMSSW stderr !
Stefano @belforte , thanks for the reply and the feedback. I appreciate it.
I was thinking that it would make a sense to have another meeting (I think we had one already some time ago) among involved people to re-sync, discuss use cases and maybe come up with an action plan.
I think we need to get at least @yuyiguo @dynamic-entropy (Rahul) @klannon there. I would invite @ericvaandering too but he is on vacation. Who am I missing ?
@belforte said: "we can detect both missing and possibly corrupted files, and tell one from the other."
I think it is important to differentiate between several types of failures.
I would add another dimension to this:
My understanding of the problem is that we want to use the "suspicious" replica state in first case for a while before we declare the replica "bad" if things do not improve, whereas if we believe this is not recoverable error, then go straight to "bad" replica state.
I do not think we need a (longish( meeting now. I'd like to have code which parses stdout/stder and does a few "mark as suspiciou" first. There can be questions arising during that which we can address as needed. In a way, I have my action plan. Maybe simply a 5-min slot in usual CMS-Rucio dev meeting to make sure that everybody agrees with the plan which you outlined ? I think you missed @dciangot , anyhow this is not urgent IMHO.
As to the recoverable vs. non-recoverable. Yes, I know, we already discussed in the meeting where you firstly presented this. The problem here is how to be sure that the specific error is really a bad file, not a transient problem in the storage server, is the file really truncated, or was a connection dropped somewhere ? So again, I'd like to get experience with the simpler path first. All in all CRAB already retries file read failures 3 times (and WMA 10, IIUC), so if we e.g. say "3 time suspicious = bad", it may be good enough.
Just as FYI, here is the suspicious replica recoverer config for ATLAS:
[
{
"action": "declare bad",
"datatype": ["HITS"],
"scope": [],
"scope_wildcard": []
},
{
"action": "ignore",
"datatype": ["RAW"],
"scope": [],
"scope_wildcard": []
}
]
Igor, can you translate that in English ?
Igor, can you translate that in English ?
I am afraid not yet
I looked at the code of the replica recoverer and here is what I understand:
It is parametrized by 2 parameters (there are some others but I think these are most relevant):
:param younger_than: The number of days since which bad_replicas table will be searched
for finding replicas declared 'SUSPICIOUS' at a specific RSE ('rse_expression'),
but 'AVAILABLE' on other RSE(s).
:param nattempts: The minimum number of appearances in the bad_replica DB table
in order to appear in the resulting list of replicas for recovery.
My understanding how it works:
It runs through all the RSEs with "enable_suspicious_file_recovery=true" flag.
It finds all suspicious replicas matching the (younger_than, nattempts) criteria.
For each such replica:
I do not see how "scope" and "scope_wild" fields from the JSON file are used.
Thinking about this, I wonder how we can handle the case when the replica fails to read, say, 3 times and succeeds on 4th attempt.
The database will have 3 "suspicious" records and if the nattempts = 3, then the replica recoverer will flag it as bad because it does not know the 4th attempt was actually successful
Would it make sense to start to collect data on bad replicas by filling the database, but wait to decide how to act on it until we have accumulated enough data in the database to understand that patterns. I don't think it's so bad if we occasionally "recover" a replica that's not really bad if most of the time the recoveries are really targeting corrupted files. But it would be nice to have some quantification of what "occasionally" and "most of the time" actually look like in practice.
That sounds good to me @klannon
In order to quantify the difference between "occasionally" and "most of the time", we will have to record not only failures (which we know how to do - using "suspicious replica" mechanism) but also successes. Is there a mechanism to do that ?
Could Rucio detect that file is OK when it goes to fix the bad replica ? I do not see a way for the client, when it correctly opens a file, to check if by any chance someone flagged it as suspicious and repair that flag. But what is really the concern here ? That were was some intermittent error which got resolved by itself so we waste one re-transfer ? Let's see how many re-transfers we do first !
I guess that brings us back to more fundamental question: why not declare the replica as bad right away and risk an unneeded re-transfer ?
Is not why we are even discussing this - to not react to transient, self-recoverable errors ?
If we make all the clients on a successful opening of a file interact with Rucio to see if there was a suspicious replica flag in the database, because wast majority of all file access attempts are successful, would not that create an unneeded inefficiency ?
why not declare the replica as bad right away and risk an unneeded re-transfer ?
yeah.. we need to pick a number of failures which gives us some confidence that it is really bad. 1 is a good choice. 3 is also a good choice. 10 seems over conservative to me. We need experience. The question is how to collect that experience w/o writing so much stuff that we spend all time debugging successful open reporting and its (over)load on the system.
Lacking that info we can simply count how many replicas we end up marking as bad. If number is low when wating for 3 suspicious report, we can lower. Or start with 1 and possibly increase.
Here are 2 suggestions:
Aggressive: Set the threshold to 1 (declare the replica as bad right away) and see if we see some problems. If we do , increate the threshold.
Conservative: Set the threshold to 5 and time window to ... week (?) and use the suspicious_replica_recoverer to declare the replica as bad and see if we have some problems. If we do, adjust the threshold.
I think this is pretty much what Stefano is proposing
And to answer the question Stefano asked earlier: No, I do not think we can ask Rucio to check the replica, which was already declared as bad, before initiating the transfer to recreate it.
Here are 2 suggestions:
let's wait until 1) I manage to mark suspicious replicas 2) we agree that it is done correctly 3) we have an idea of how often it happens
I plan to resubmi jobs which failed because of this, so 3 "tags" are somehow guaranteed to arrive in O(1 day) if problem is real.
I understand that ball in on my side atm. Will do my best to kick it back to you !
As a site operator, here is my two cents:
The signature for "missing file" is "not exist" (in the namespace). The response is pretty quick. The remedy is to copy the file, assuming the file should be there.
If the file is in the namespace but not available, usually it indicates a hardware issue or a system issue. An automatic replacement might NOT solve the problem and might, very likely, create a new (problem). The best way is to notify the site. They might have already been working on it.
If the transfer timed out, there are many possibilities and would be very hard to do (fix) it right. We might "flag" it but not fix it.
Keep in mind that we are dealing with program bugs from time to time (guess who put the bugs in there? developers!). Whatever we do must have a safeguard and whatever "automatic fix" should have an option to opt out by site. Whatever we do, automatically or not, the site should be notified.
@huangch-fnal Chih-Hao, thanks for your input.
Let me try to summarize your feedback in terms of condition-action pairs:
Did I get it right ?
Close. 3. and 4. may or may not be recoverable. To sum up, transfer failure due to actually corrupted file is rare. In all cases, the site should be notified. At FNAL, we are not in favor of automatic recovery unless it is proven safe.
here's a fresh example of corrupted file (at T2_US_Florida) https://cms-talk.web.cern.ch/t/corrupted-doublemuon-miniaod-file/29163
I am happy to know that this will never happen at FNAL, but I do not see why there should any question in such cases about automatically declare the replica bad and trigger a re-transfer from tape. What do you think @huangch-fnal ? Would any harm happen in such a case ? One aim we have here is also to avoid bugging site admins unelss absolutely needed, i.e. after all that could be done by machines, failed.
@ivmfnal In this example the user retried the job14 times ! Can we assume that once CRAB declares the replica suspicious maybe "only" 6 times, the automatic part can kick in ? I'd rather not wait a week, though.
@belforte my understanding is that the replica recoverer can be configured like that
@ivmfnal @ericvaandering @dciangot I have started with code to report bad files in CRAB. But I have found that cmsRun output does not provide (clear) information on whether the file open attempt was local, or on a remote site via xrootd. So I do not have a good way to say which is the RSE (or the pfn required by Rucio).
Is there a way to report to rucio only a suspicious DID, and have it figure out which of the possible multiple disk replicas is corrupted ? Or do we need an additional daemon on our side , somewhere ?
See e.g the example in https://github.com/dmwm/CRABServer/issues/7548#issuecomment-1713880031 even if the bad file is at CERN, cmsRun reports
[c] Fatal Root Error: @SUB=TStorageFactoryFile::Init
file root://cmsxrootd.fnal.gov//store/user/belforte/truncatedFile.root is truncated at 52428800 bytes
No, Rucio doesn’t have any innate checking abilities like this.
The “daemon” you suppose would try reading every file directly over xrootd to find the corrupted one(s)?
On Sep 11, 2023, at 8:29 AM, Stefano Belforte @.***> wrote:
@ivmfnal https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=BZuidikJVKW2O2gEb8s17pB2YV1_SdHUWyNqUK19F0w&e= @ericvaandering https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=KXOvCA_Q_S1rwnae3hgHuuOpzAMCLJTB8gVaS3higiQ&e= @dciangot https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dciangot&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=qqhAZ7JFjZfdcdW8TFXNwHc7lW81787T5qefoEEUZKE&e= I have started with code to report bad files in CRAB. But I have found that cmsRun output does not provide (clear) information on whether the file open attempt was local, or on a remote site via xrootd. So I do not have a good way to say which is the RSE (or the pfn required by Rucio).
Is there a way to report to rucio only a suspicious DID, and have it figure out which of the possible multiple disk replicas is corrupted ? Or do we need an additional daemon on our side , somewhere ?
See e.g the example in dmwm/CRABServer#7548 (comment) https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CRABServer_issues_7548-23issuecomment-2D1713880031&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=_6JlGpQ28FdBuh5bw2Nb27UZ-wDSgFJWUDyC7iWNJOE&e= even if the bad file is at CERN, cmsRun reports
[c] Fatal Root Error: @SUB=TStorageFactoryFile::Init
file root://cmsxrootd.fnal.gov//store/user/belforte/truncatedFile.root is truncated at 52428800 bytes — Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_403-23issuecomment-2D1713887191&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=aAMz2yzyPJnlkKIEidVEk0wfRBjbtx-cWFGG0rzwNSU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLQF4LAYHMV4WGGIEKLXZ4G3LANCNFSM6AAAAAAT2MUYLY&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=TE8F43KLaV2RFtjhKYaivBHuN67clsQxjuj8EoqUrss&e=. You are receiving this because you were mentioned.
or do a rucio get which checks checksum
On 11/09/2023 16:03, Eric Vaandering wrote:
The “daemon” you suppose would try reading every file directly over xrootd to find the corrupted one(s)?
would not it be easier to have CRAB be specific which replica failed to read ?
CRAB needs that CMSSW if the error happened opening a local file or a fallback one. See my example. The corrupted file is at CERN, cmsRun only mentions FNAL :-(
I know. Would not it be easier to have CRAB tell which replica was corrupt - the remote or the local ?
which means: I can run that daemon ! Yes, I know. I may have to do something of that sort anyhow. Mostly I am realizing that we do no have a good classification/reporting of the file open/read errors.
I wasn't suggesting that you need to write or run it. Just the process you are suggesting. WMAgent will probably run into the same issue.
So I understand the problem:
CRAB gets this error from CMSSW but CMSSW does not give enough information to know if the file is read locally or remotely?
Then, even if we knew "remotely" I could imagine problems knowing which remote file was read. CMSSW may not have any way of knowing that.
Correct. Usually when xroot succeds in opening a remote file, the new PFN is printed in cmsRun log, but much to my disappointment, when the file open fails, nothing gets printed. In a way since the file is remote to begin with, cmssw handles it like a "first time open" and fails with 8020, not as a "fallback open" which would result in 8028.
Of course many times the file will be opened locally and the site where job runs is sufficient informaiton. But I am not sure how to reliably tell.
I think we nedd some "exploratory work" so am going to simply reports "suspected corruptions" as files on CERN EOS to be consumed by some script (crontab e,g,) which can cross check and possibly make a rucio call, so that script can check the multiple replicas, if any.
IIUC Dima's plans NANO* aside, we are going to have a single disk copy of files.
(somehow) later on we can think of incorporating all that code into something that parses cmsrun stdout and makes a call to Rucio on the fly. But I am not sure that we want to run Rucio client in WorkerNodes. It should be available since it is on cvmfs. I guess I will not post more for a bit, while I try to map what's the situation out there.
Would need something done with traces (declaring things suspicious, maybe some logic to deal with xrootd and specific exit codes?)
Then need to run the replica recoverer daemon.