Investigate automatic suspicious replica recovery

ericvaandering commented 1 year ago

Would need something done with traces (declaring things suspicious, maybe some logic to deal with xrootd and specific exit codes?)

Then need to run the replica recoverer daemon.

yuyiguo commented 1 year ago

@ericvaandering , Can you give more details on this issue?

belforte commented 1 year ago

@yuyiguo @ericvaandering I guess I am the one who started this. Shall we try to have a quick chat on zoom ? I am generally/usually available in your early morning (8-10), But if you prefer that I expand here.. sure, let me know.

yuyiguo commented 1 year ago

Yes, I am available today. Let me know when we can chat.

belforte commented 1 year ago

wel... today I had things to do. Let's plan it a little bit so that Eric may also join. 10 min should suffice

belforte commented 1 year ago

what about tomorrow in your 8-10am window ?

belforte commented 1 year ago

Adding some description after a chat with Eric and Yuyi (Katy was also there) @KatyEllis FYI

where we start from

in this meeting Dimitrios Christidis reported in slide 11 that
- For each failed transfer, if the error message matches some pattern, then the replica is marked as suspicious. The Rucio Replica Recoverer daemon processes these and does automatic recovery

what we want to do

use that Replica Recover daemon to automatically check and fix files which we suspect of being corrupted because they make cmsRun(CMSSW) raise a fatal exception when reading.

things we know

cmsRun sends UDP messages when it closed files which it read successfully, those are turned somehow/somewhere into messages to AMQ which we (CMS) consume (some daemon which Yuyi is in control of)
the path above is not viable to report read failures, since in that case cmsRun just exit via a fatal exception and has no way to go through the UDP-sending part
Writing to AMQ requires authentication, so it is secure and can be tuned/throttled rate wise
Sending UDP messages is free for all to do and if too many are sent some get lost and there's never DDOS risk. Should something wrong/bad/malicious happen, an extra ton "I read this file" will do not permanent harm, but a flood of "this file is bad" may be very very bad.
there are other AMQ queues which are consumed by Yuyi's daemon and where we could send the information about bad files. In particular one referred to as "the job report parser which reads from ES and writes to AMQ
Rucio python client has declare_suspicious_file_replicas(self, pfns, reason) in lib/rucio/client/ by which a replica (PFN) can be declared suspicious, and a (hopefully matching) declare_suspicious_file_replicas(self, pfns, reason) in lib/rucio/api
Same as above can also be done via a REST call.
There is also a way to declare a file as suspicious via a non-authenticated URL (?? did I misunderstand ?)
We are not running the Rucio Replica Recoverer daemon now, but can (and should)
Currently jobwrappers (WMA or CRAB) communicate with the world via the FJR which is returned to scheduler and sent to WMArchive (and from there fo HDFS and/or ES but Stefano does not know) and via condor classAds which are read by monitoring spider and sent to ES and used e.g. to populate Grafana dashboards in CMS Monitoring.

things we do not know, but want to know

what exactly happens when a replica is declared suspicious (is it still used as source ? processed by workflows ? E.g. is the state of the (Rucio) dataset affected ? CRAB only processes datasets with status AVAILABLE
how exactly data are currently fed to AMQ: by which tools, run by whom, which data they are etc.
would it be possible (sensible?) to send read errors to AMQ and have "Yuyi's daemon" use those messages to declare supsicious replicas ?
if we decide instead to declare replicas directly from some upstream process (instead of sending a message to AMQ), do we have the needed information and authorization there ? CRAB server currently authenticates with account crab_server or with crab_tape_recall. Other daemons/processes which currently write to AMQ ? What would be the risk in granting to CRAB server the needed privilege ?
we detect errors via a "local read PFN" and will have the file LFN, but declare_suspicious_file_replica needs a PFN in .. which format ? Can we really send {'scope': , 'name': , 'rse': } instead ?
do we have multiple option to declare suspicious replicas in Rucio ? Or in the end python client, url where it post, call in api...all proceeds via the same code (in core ? ). In the end, what exactly happens ?

things we should do (i.e. The Plan)

complete our knowledge of topics listed above (Rucio part: Yuyi. AMQ: Yuyi. FJR to AMQ: Stefano)
evaluate and prototype a change to job wrapper that identifies read error and put info in FJR, exit code and classAds (exit code goes already into a classAd, but the failing PFN does not) (Stefano)
evaluate what would be the best way for job wrapper to report read errors so that replica can be declared suspicious. Stefano will propose a plan once he has the needed knowledge. Stefano prefers a solution where CRAB and WMA signal errors by parsing cmsRun stderr in grid nodes, but reporting to Rucio is done in a single place common to both.
agree on the plan (all) and assign specific TODO issues.

dciangot commented 1 year ago

did this issue follow up somewhere else? Or is it just stale and we still need to validate @belforte proposal with the various tasks?

belforte commented 1 year ago

is still on my todo list and I am not tracking it elsewhere. I should. Then this can be put on hold until I have a proposal.

belforte commented 1 year ago

currently this breaks up as

[ ] investigate Rucio’s ReplicaRecoverer daemon : Yuyi (need and ad hoc issue ?)
[ ] improve JobWrapper to detect suspicious replicas https://github.com/dmwm/CRABServer/issues/7548 : Stefano
[ ] investigate how to send this information to Rucio https://github.com/dmwm/CRABServer/issues/7549 Stefano + Yuyi
[ ] formulate a plan to be reviewed https://github.com/dmwm/CRABServer/issues/7550 : Stefano

yuyiguo commented 1 year ago

This is on my to-do list too but in a low priority.

belforte commented 1 year ago

First thing is to flag jobs which hit corrupted files, and monitor it, so we quantify problem.

yuyiguo commented 1 year ago

What is the definition of "suspicious replicas"? If a transfer failed, fts will try to retransfer it. If the failure is permanent, how Replica Recover daemon can fix it? Why CMSSW or processing jobs will read a suspicious replica?

belforte commented 1 year ago

Hi @yuyiguo . Let me try to list here what I (think that I) understand. Hopefully it answers some of your questions.

If we plan to use declare_suspicious_file_replicas definition is up to us.
Current wisdom is to check cmsRun stdout for fatal errors during root readout, we may have to collect some telling strings and improve with time.
ATLAS does a rucio download of each files to the WN before reading, so Rucio will verify checksum there and IIUC automatically mark replica as suspicious. I have tested this myself [1] and the error is duly detected, but I think something else is needed to mark the replica as suspicious, Most likely there's something else going one when ATLAS jobs try to read files and we need to ask Dimitrios to be more explicit than in his talk mentioned in comment above. Anyhow we do not download files from storage to WN's.
We do not know why replicas get corrupted, most likely it is due to some problem with site's storage systems, as you say FTS transfers are expected to verify checksums. While of course, sometimes a bit can still be flipped in last step of file write/close and one only notices when file is read back.
data processing jobs do not know that a given replica is corrupted, so they read it and fail. I do not know what will happen if replica is marked suspicious. We only process datasets/blocks which are marked as AVAILABLE in a Disk RSE in Rucio, I can't say how a SUSPICIOUS replica affects that. Anyhow currently bad replicas stay on disk until manually removed by DM operators.
and finally, about how Replica Recover can fix it, all that I know is this and I can't say if this is really what is in use for ATLAS. I have not tried to read the daemon code itself, but it has more comments with details. I definitely have no clue what metadata means here.

Hope this helps !

[1]

belforte@lxplus701/belforte> rucio download cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root --rses T2_US_UCSD
2023-03-27 17:57:51,728 INFO    Processing 1 item(s) for input
2023-03-27 17:57:51,847 INFO    No preferred protocol impl in rucio.cfg: No section: 'download'
2023-03-27 17:57:51,848 INFO    Using main thread to download 1 file(s)
2023-03-27 17:57:51,848 INFO    Preparing download of cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:57:51,866 INFO    Trying to download with davs and timeout of 4713s from T2_US_UCSD: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root 
2023-03-27 17:57:51,936 INFO    Using PFN: davs://redirector.t2.ucsd.edu:1095/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,713 WARNING Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,714 WARNING Download attempt failed. Try 1/2
2023-03-27 18:02:02,656 WARNING Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,657 WARNING Download attempt failed. Try 2/2
2023-03-27 18:02:02,670 ERROR   Failed to download file cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,672 ERROR   None of the requested files have been downloaded.
belforte@lxplus701/belforte>

belforte commented 1 year ago

btw, the file in my example above has been fixed by Felipe and download is now OK. https://mattermost.web.cern.ch/cms-o-and-c/pl/uw4mauamr3dxmf1aebpfsybcjo https://ggus.eu/?mode=ticket_info&ticket_id=160902

ivmfnal commented 1 year ago

Some time ago I wrote this document with the proposal how we can handle this. Has anybody had a chance to read it ? Do we need to revisit that proposal ? I am a bit confused about the state of that proposal.

ivmfnal commented 1 year ago

Basically here is my proposal:

Detection
- Add code to WM and other similar activities to report a replica as "suspicious" to Rucio on a read error via standard Rucio interfaces (API or CLI)
- Write instructions for individual users how to do the same
Recovery. Configure existing Rucio suspicious_replica_recoverer to declare the replica as "bad" after so many detections. Once the replica is declared "bad", Rucio will re-replicate it. We can discuss separately the case when this replica is in fact last replica for the file.

I would like to get some feedback on this proposal. Perhaps we can add more details to it. Once we agree on the plan, we can work out the details.

belforte commented 1 year ago

thanks Igor, and apologies for not having replied earlier. I had indeed read your document and fully agree with it and the plan outlined above. I have in, and reasonably close to the top of my list, to do the 1. above in CRAB: https://github.com/dmwm/CRABServer/issues/7548

I would like to see this at work from automatic/automated tools for a while, before we think about enabling users, at that point we may have to introduce some way to "trust machines more than humans" IMHO.

One thing that I expect we can talk about later, but let me mention now: we can detect both missing and possibly corrupted files, and tell one from the other. Should we also flag the missing ones (i.e. clean open failures w/o a zip error or wrong checksum) as suspicious and sort of try to shortcut the CE ? I am a bit worried when looking at CE page by how many sites simply fail week after week to give any useful result, looks only half sites or so have a "done". I am not saying to abandon that effort, simplt complement it.

We can surely resume this once I have code which parses CMSSW stderr !

ivmfnal commented 1 year ago

Stefano @belforte , thanks for the reply and the feedback. I appreciate it.

I was thinking that it would make a sense to have another meeting (I think we had one already some time ago) among involved people to re-sync, discuss use cases and maybe come up with an action plan.

I think we need to get at least @yuyiguo @dynamic-entropy (Rahul) @klannon there. I would invite @ericvaandering too but he is on vacation. Who am I missing ?

ivmfnal commented 1 year ago

@belforte said: "we can detect both missing and possibly corrupted files, and tell one from the other."

I think it is important to differentiate between several types of failures.

I would add another dimension to this:

potentially recoverable failures - those which have chances to be caused by some transient condition like networking failure or the site downtime
non-recoverable failures - e.g. checksum/file size/file format error - conditions which are unlikely to fix themselves

My understanding of the problem is that we want to use the "suspicious" replica state in first case for a while before we declare the replica "bad" if things do not improve, whereas if we believe this is not recoverable error, then go straight to "bad" replica state.

belforte commented 1 year ago

I do not think we need a (longish( meeting now. I'd like to have code which parses stdout/stder and does a few "mark as suspiciou" first. There can be questions arising during that which we can address as needed. In a way, I have my action plan. Maybe simply a 5-min slot in usual CMS-Rucio dev meeting to make sure that everybody agrees with the plan which you outlined ? I think you missed @dciangot , anyhow this is not urgent IMHO.

As to the recoverable vs. non-recoverable. Yes, I know, we already discussed in the meeting where you firstly presented this. The problem here is how to be sure that the specific error is really a bad file, not a transient problem in the storage server, is the file really truncated, or was a connection dropped somewhere ? So again, I'd like to get experience with the simpler path first. All in all CRAB already retries file read failures 3 times (and WMA 10, IIUC), so if we e.g. say "3 time suspicious = bad", it may be good enough.

ivmfnal commented 1 year ago

Just as FYI, here is the suspicious replica recoverer config for ATLAS:

[
    {
        "action": "declare bad",
        "datatype": ["HITS"],
        "scope": [],
        "scope_wildcard": []
    },
    {
        "action": "ignore",
        "datatype": ["RAW"],
        "scope": [],
        "scope_wildcard": []
    }
]

https://gitlab.cern.ch/atlas-adc-ddm/flux-rucio/-/blob/master/releases/base/secrets/suspicious_replica_recoverer.json

belforte commented 1 year ago

Igor, can you translate that in English ?

ivmfnal commented 1 year ago

Igor, can you translate that in English ?

I am afraid not yet

ivmfnal commented 1 year ago

I looked at the code of the replica recoverer and here is what I understand:

It is parametrized by 2 parameters (there are some others but I think these are most relevant):

    :param younger_than: The number of days since which bad_replicas table will be searched
                         for finding replicas declared 'SUSPICIOUS' at a specific RSE ('rse_expression'),
                         but 'AVAILABLE' on other RSE(s).
    :param nattempts: The minimum number of appearances in the bad_replica DB table
                      in order to appear in the resulting list of replicas for recovery.

My understanding how it works:

It runs through all the RSEs with "enable_suspicious_file_recovery=true" flag.

It finds all suspicious replicas matching the (younger_than, nattempts) criteria.

For each such replica:

if there are other replicas of the file elsewhere, declare this suspicious replica as bad
if this is last replica, depending on file metadata "datatype" parameter, choose the action as described by the JSON file. It can be either "ignore" or "declare bad". If there is no "datatype" parameter, it ignores the replica.

I do not see how "scope" and "scope_wild" fields from the JSON file are used.

ivmfnal commented 1 year ago

Thinking about this, I wonder how we can handle the case when the replica fails to read, say, 3 times and succeeds on 4th attempt.

The database will have 3 "suspicious" records and if the nattempts = 3, then the replica recoverer will flag it as bad because it does not know the 4th attempt was actually successful

klannon commented 1 year ago

Would it make sense to start to collect data on bad replicas by filling the database, but wait to decide how to act on it until we have accumulated enough data in the database to understand that patterns. I don't think it's so bad if we occasionally "recover" a replica that's not really bad if most of the time the recoveries are really targeting corrupted files. But it would be nice to have some quantification of what "occasionally" and "most of the time" actually look like in practice.

KatyEllis commented 1 year ago

That sounds good to me @klannon

ivmfnal commented 1 year ago

In order to quantify the difference between "occasionally" and "most of the time", we will have to record not only failures (which we know how to do - using "suspicious replica" mechanism) but also successes. Is there a mechanism to do that ?

belforte commented 1 year ago

Could Rucio detect that file is OK when it goes to fix the bad replica ? I do not see a way for the client, when it correctly opens a file, to check if by any chance someone flagged it as suspicious and repair that flag. But what is really the concern here ? That were was some intermittent error which got resolved by itself so we waste one re-transfer ? Let's see how many re-transfers we do first !

ivmfnal commented 1 year ago

I guess that brings us back to more fundamental question: why not declare the replica as bad right away and risk an unneeded re-transfer ?

Is not why we are even discussing this - to not react to transient, self-recoverable errors ?

ivmfnal commented 1 year ago

If we make all the clients on a successful opening of a file interact with Rucio to see if there was a suspicious replica flag in the database, because wast majority of all file access attempts are successful, would not that create an unneeded inefficiency ?

belforte commented 1 year ago

why not declare the replica as bad right away and risk an unneeded re-transfer ?

yeah.. we need to pick a number of failures which gives us some confidence that it is really bad. 1 is a good choice. 3 is also a good choice. 10 seems over conservative to me. We need experience. The question is how to collect that experience w/o writing so much stuff that we spend all time debugging successful open reporting and its (over)load on the system.

Lacking that info we can simply count how many replicas we end up marking as bad. If number is low when wating for 3 suspicious report, we can lower. Or start with 1 and possibly increase.

ivmfnal commented 1 year ago

Here are 2 suggestions:

Aggressive: Set the threshold to 1 (declare the replica as bad right away) and see if we see some problems. If we do , increate the threshold.

Conservative: Set the threshold to 5 and time window to ... week (?) and use the suspicious_replica_recoverer to declare the replica as bad and see if we have some problems. If we do, adjust the threshold.

I think this is pretty much what Stefano is proposing

ivmfnal commented 1 year ago

And to answer the question Stefano asked earlier: No, I do not think we can ask Rucio to check the replica, which was already declared as bad, before initiating the transfer to recreate it.

belforte commented 1 year ago

Here are 2 suggestions:

let's wait until 1) I manage to mark suspicious replicas 2) we agree that it is done correctly 3) we have an idea of how often it happens

I plan to resubmi jobs which failed because of this, so 3 "tags" are somehow guaranteed to arrive in O(1 day) if problem is real.

I understand that ball in on my side atm. Will do my best to kick it back to you !

huangch-fnal commented 1 year ago

As a site operator, here is my two cents:

At FNAL, we are constantly watching "corrupted" files. In T1_US_FNAL_Disk (a dCache storage), we have, now, 482 "bad" files among all 52916776 files. So, files getting corrupted in the storage is rare. Usually we do not have to deal with them until users complain, which (hitting bad files) is rare, too. Once in a blue moon, we would actively clean them up. If, all of a sudden there are many corrupted files, something major is happening. An external "fix" might not fix it.
In failed transfers, if the same source file, regardless of the destinations, consistently gives the same wrong checksum, we are pretty sure it is corrupted at source. For example, on August 7, in the past 24 hours, we saw davs://xrootd-local.unl.edu:1094/store/mc/RunIISummer20UL17MiniAODv2/BToJpsi_JPsiToMuMu_BMuonFilter_HardQCD_TuneCP5_13TeV-pythia8-evtgen/MINIAODSIM/NoPU_106X_mc2017_realistic_v9-v1/30000/3F9CC6E8-121B-CB4E-B7F4-B8CC8CCB5749.root had 15 transfer failures all ended up with "TRANSFER [5] DESTINATION CHECKSUM MISMATCH Source and destination ADLER32 do not match (b270048f != d64a7676) ". We got this information by scanning FTS (in terms of gfal) log. The same information should be in Rucio, too.
If the corrupted file is in Disk instance (no tape backend), usually replacing the file, if possible, is the remedy. There is a caveat. In dCache, a "hot" file might have multiple copies in multiple pools (to increase its availability and throughput). The transfer might just hit a bad copy. In that case, the remedy is to remove that bad copy without deleting other ones. An external replacement would effectively kill the other copies.
If the corrupted file is in Tape instance, there is a high probability that the tape copy is good while the copy in the pool is corrupted (stage error), otherwise, the checksum mismatch would have been discovered when the file was written to tape. In this case, we removed the copy from the pool and re-stage the file again. This is very rare and broken files in the pool would be discovered by dCache. In T1_US_FNAL_Tape, there is, now, 0 corrupted files in all pools.
known corrupted files can be dealt with appropriately.

The signature for "missing file" is "not exist" (in the namespace). The response is pretty quick. The remedy is to copy the file, assuming the file should be there.

If the file is in the namespace but not available, usually it indicates a hardware issue or a system issue. An automatic replacement might NOT solve the problem and might, very likely, create a new (problem). The best way is to notify the site. They might have already been working on it.

If the transfer timed out, there are many possibilities and would be very hard to do (fix) it right. We might "flag" it but not fix it.

Keep in mind that we are dealing with program bugs from time to time (guess who put the bugs in there? developers!). Whatever we do must have a safeguard and whatever "automatic fix" should have an option to opt out by site. Whatever we do, automatically or not, the site should be notified.

ivmfnal commented 1 year ago

@huangch-fnal Chih-Hao, thanks for your input.

Let me try to summarize your feedback in terms of condition-action pairs:

Bad chacksum - irrecoverable - declare the replica "bad"
File not found - irrecoverable - CE will detect it within a week and declare as "bad"
File unavailable - recoverable - wait, notify the site
Timeout - recoverable - wait, notify the site

Did I get it right ?

huangch-fnal commented 1 year ago

Close. 3. and 4. may or may not be recoverable. To sum up, transfer failure due to actually corrupted file is rare. In all cases, the site should be notified. At FNAL, we are not in favor of automatic recovery unless it is proven safe.

belforte commented 1 year ago

here's a fresh example of corrupted file (at T2_US_Florida) https://cms-talk.web.cern.ch/t/corrupted-doublemuon-miniaod-file/29163

I am happy to know that this will never happen at FNAL, but I do not see why there should any question in such cases about automatically declare the replica bad and trigger a re-transfer from tape. What do you think @huangch-fnal ? Would any harm happen in such a case ? One aim we have here is also to avoid bugging site admins unelss absolutely needed, i.e. after all that could be done by machines, failed.

@ivmfnal In this example the user retried the job14 times ! Can we assume that once CRAB declares the replica suspicious maybe "only" 6 times, the automatic part can kick in ? I'd rather not wait a week, though.

ivmfnal commented 1 year ago

@belforte my understanding is that the replica recoverer can be configured like that

belforte commented 1 year ago

@ivmfnal @ericvaandering @dciangot I have started with code to report bad files in CRAB. But I have found that cmsRun output does not provide (clear) information on whether the file open attempt was local, or on a remote site via xrootd. So I do not have a good way to say which is the RSE (or the pfn required by Rucio).

Is there a way to report to rucio only a suspicious DID, and have it figure out which of the possible multiple disk replicas is corrupted ? Or do we need an additional daemon on our side , somewhere ?

See e.g the example in https://github.com/dmwm/CRABServer/issues/7548#issuecomment-1713880031 even if the bad file is at CERN, cmsRun reports

      [c] Fatal Root Error: @SUB=TStorageFactoryFile::Init
file root://cmsxrootd.fnal.gov//store/user/belforte/truncatedFile.root is truncated at 52428800 bytes

ericvaandering commented 1 year ago

No, Rucio doesn’t have any innate checking abilities like this.

The “daemon” you suppose would try reading every file directly over xrootd to find the corrupted one(s)?

On Sep 11, 2023, at 8:29 AM, Stefano Belforte @.***> wrote:

@ivmfnal https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=BZuidikJVKW2O2gEb8s17pB2YV1_SdHUWyNqUK19F0w&e= @ericvaandering https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=KXOvCA_Q_S1rwnae3hgHuuOpzAMCLJTB8gVaS3higiQ&e= @dciangot https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dciangot&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=qqhAZ7JFjZfdcdW8TFXNwHc7lW81787T5qefoEEUZKE&e= I have started with code to report bad files in CRAB. But I have found that cmsRun output does not provide (clear) information on whether the file open attempt was local, or on a remote site via xrootd. So I do not have a good way to say which is the RSE (or the pfn required by Rucio).

Is there a way to report to rucio only a suspicious DID, and have it figure out which of the possible multiple disk replicas is corrupted ? Or do we need an additional daemon on our side , somewhere ?

See e.g the example in dmwm/CRABServer#7548 (comment) https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CRABServer_issues_7548-23issuecomment-2D1713880031&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=_6JlGpQ28FdBuh5bw2Nb27UZ-wDSgFJWUDyC7iWNJOE&e= even if the bad file is at CERN, cmsRun reports
  [c] Fatal Root Error: @SUB=TStorageFactoryFile::Init
file root://cmsxrootd.fnal.gov//store/user/belforte/truncatedFile.root is truncated at 52428800 bytes — Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_403-23issuecomment-2D1713887191&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=aAMz2yzyPJnlkKIEidVEk0wfRBjbtx-cWFGG0rzwNSU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLQF4LAYHMV4WGGIEKLXZ4G3LANCNFSM6AAAAAAT2MUYLY&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=TE8F43KLaV2RFtjhKYaivBHuN67clsQxjuj8EoqUrss&e=. You are receiving this because you were mentioned.

belforte commented 1 year ago

or do a rucio get which checks checksum

On 11/09/2023 16:03, Eric Vaandering wrote:

The “daemon” you suppose would try reading every file directly over xrootd to find the corrupted one(s)?

ivmfnal commented 1 year ago

would not it be easier to have CRAB be specific which replica failed to read ?

belforte commented 1 year ago

CRAB needs that CMSSW if the error happened opening a local file or a fallback one. See my example. The corrupted file is at CERN, cmsRun only mentions FNAL :-(

ivmfnal commented 1 year ago

I know. Would not it be easier to have CRAB tell which replica was corrupt - the remote or the local ?

belforte commented 1 year ago

which means: I can run that daemon ! Yes, I know. I may have to do something of that sort anyhow. Mostly I am realizing that we do no have a good classification/reporting of the file open/read errors.

ericvaandering commented 1 year ago

I wasn't suggesting that you need to write or run it. Just the process you are suggesting. WMAgent will probably run into the same issue.

So I understand the problem:

CRAB gets this error from CMSSW but CMSSW does not give enough information to know if the file is read locally or remotely?

Then, even if we knew "remotely" I could imagine problems knowing which remote file was read. CMSSW may not have any way of knowing that.

belforte commented 1 year ago

Correct. Usually when xroot succeds in opening a remote file, the new PFN is printed in cmsRun log, but much to my disappointment, when the file open fails, nothing gets printed. In a way since the file is remote to begin with, cmssw handles it like a "first time open" and fails with 8020, not as a "fallback open" which would result in 8028.

Of course many times the file will be opened locally and the site where job runs is sufficient informaiton. But I am not sure how to reliably tell.

I think we nedd some "exploratory work" so am going to simply reports "suspected corruptions" as files on CERN EOS to be consumed by some script (crontab e,g,) which can cross check and possibly make a rucio call, so that script can check the multiple replicas, if any.

IIUC Dima's plans NANO* aside, we are going to have a single disk copy of files.

belforte commented 1 year ago

(somehow) later on we can think of incorporating all that code into something that parses cmsrun stdout and makes a call to Rucio on the fly. But I am not sure that we want to run Rucio client in WorkerNodes. It should be available since it is on cvmfs. I guess I will not post more for a bit, while I try to map what's the situation out there.

dmwm / CMSRucio