[DEV] Suspicious replica recoveror: Enrich rucio tracers to include file read errors

haozturk commented 9 months ago

Needed for https://github.com/dmwm/CMSRucio/issues/403 . Traces come from

 WMArchive: /topic/cms.jobmon.wmarchive 
 CMSSWPOP: /topic/cms.swpop 
 xrootd: /topic/xrootd.cms.aaa.ng

I reckon we need to talk to the producers of these topics. I reckon, it's WMCore team for WMArchive, Bockjoo for xrootd. How about CMSSWPOP? Does CRAB push any data to AMQ? @ericvaandering any clues?

Context: https://indico.cern.ch/event/1356295/

ericvaandering commented 9 months ago

I'd start with Matti Kortelainen for CMSSW.

haozturk commented 9 months ago

Thanks Eric, I'll contact him. For reference, here are the links to the existing data:

haozturk commented 9 months ago

For WMArchive; I see that it already has the error information. See this link for production and this link for CRAB and look at data.steps field of a random entry. The only problem is that it's not indexed, so it's not possible do queries using it. Now my plan is to make changes in rucio-tracers repo such that we parse this info and push it to /topic/cms.rucio.tracer in the right field.

For xrootd and cmssw; we still don't know how to do it. Bockjoo doesn't know for AAA and I didn't get a reply from Matti, yet. I'll keep investigating

makortel commented 9 months ago

To my understanding the "CMSSW popularity" information originates from CMSSW's StatisticsSenderService that sends UDP packets to "somewhere". The Service sends the UDP packet with bunch of information whenever the primary / secondary(=two-file solution) / embedded(=pileup) file is closed. While extending the data sent in via UDP would be straightforward (it's JSON after all), adding information on file read errors specifically does not look straightforward. If you really want to, we can take a deeper dive on what the implementation would entail, in which case please open a feature request issue in CMSSW GitHub.

Before committing to any development I'd like to understand why the information in WMArchive (that is filled from the CMSSW framework job reports from both production and CRAB(?)) would not be sufficient. Do you e.g. want to catch the read errors from all the users' non-CRAB jobs as well?

haozturk commented 8 months ago

Thanks @makortel this is useful. I agree that we should start with WMArchive.

@yuyiguo I think you're one of the developers of rucio-tracers. In the first glance, it seems this task can be accomplished by feeding the errors of data.steps field in WMArchive into stateReason field of rucio traces. I'm looking into how this can be accomplished. If you have comments on the subject before I start the implementation, it's very much appreciated. My only worry is that errors field can be quite large in size. I don't know whether this would cause any issue.

haozturk commented 7 months ago

Hi @ericvaandering @yuyiguo How can I test my changes in rucio-tracers? Is there a test queue that I can use to consume my implementation?

Edit: Adding in @dynamic-entropy as well in case he knows

dynamic-entropy commented 7 months ago

I never looked at this, so cannot give an exact answer. But you can subscribe to the same queue with a different client and you will receive the same events without affecting prod.

haozturk commented 7 months ago

Thanks Rahul. We had a chat with Rahul and Nikodemas offline and we'll request a new subscriber for this queue to be used for testing. If anybody has already a test subscriber for this queue, please let me know, so that we can avoid double work

belforte commented 5 months ago

I should have read this issue earlier...

We intend to decommission WMArchive for CRAB, and we are not sending data there since a few weeks.
my understanding is that it is difficult/impossible for CMSSW to send info about root fatal errors since the info is in the xroot excepton which is not parsed by the framework
I would be happy to make CRAB send an UDP packed just like CMSSW does when we discover a hint of corrupted file in the logs. Can someone tell me how to do it and what the format should be ?
Or CRAB can send directly to the queue mentioned above. In this case better to get @mapellidario involved since he knows about STOMP already (if STOMP is needed here).
- whichever means, I think we should send one (short) message per file and let Rucio decide how many reports to get before taking action

dmwm / CMSRucio

[DEV] Suspicious replica recoveror: Enrich rucio tracers to include file read errors #691