Closed dciangot closed 1 year ago
I am not sure I understood the problem, but from the MM discussion, it seemed that the problem was a bug in Granfana (https://mattermost.web.cern.ch/cms-o-and-c/pl/nan1h5h5mjfa3y4kzdmyyp4cfe) . So this ticket can be closed? @dciangot
unfortunately MM threads always cover many things. There are a few files where the Rucio last-accessed-time is set to the max possible integer (year 2038) and this confuse monitoring tools who look for latest access time of a dataset by picking MAX of the last access time of all files. Panos checked and found that the spurious 2038 is for a few files created circa 2018. Can you please overwrite that in the DB with something which is about year 2010 ?
@yuyiguo this is the correct thread
Ok, Thanks @belforte for explaining the problem. Can you or someone let me what the files DIDs and what the last-accessed-time? should be? 01/01/2018?
@mrceyhun or @panos512 can you answer this ? Yuyi, in my mind the idea was that if they can tell you the "bad value", you can select for that in the DB, w/o needing a list of DIDs. Maybe one or two file names as example will do.
I would prefer to only change the confirmed rows because this is production DB.
@yuyiguo I think you can find them easily with a simple SQL query. Just query the accessed_at
greater than current timestamp. For instance:
select count(*) from CMS_RUCIO_PROD.REPLICAS where ACCESSED_AT>=DATE '2023-03-25'
(3 days of safety :) )
What date you want to change to?
If the files came to Rucio in 2018, why they showed as a problem now? Do you know all the DIDs have future dates from 2018?
@yuyiguo I want to help you but unfortunately I don't know better than you. I can provide you the DIDs but I will use above query that I shared. And I don't know which date they should be set. From a monitoring point of view, I don't see a difference between setting their accessed at as 2018 or NULL :) This is just my opinion.
Since I don't know the data and how the problem was made. Someone who knows the data and the correct number let me know the details. I haven't checked the DB yet, but in most cases, NULL will not be accepted. The query you gave would cover all the dates that were not meaningful, but they could come from different years than 2018. We should not just put junk in the DB to fix one problem and cause other problems in the future. Panos already researched and seemed to know the data, Let's wait for his reply.
Looked into the DB, and we have 475 dids that have accessed_at > 01-Jan-38
. Some of them were just updated yesterday/2023-03-21 10:15:16. A lot of the were created in 2020. Why is their accessed_at set to 2038? People don't want the data to be deleted? They looked like all mc data, ex name: /store/mc/Run3Winter20DRMiniAOD/DoublePhoton_FlatPt-1000To1500/AODSIM/FlatPU0to80_110X_mcRun3_2021_realistic_v6-v1/110000/9231CCE7-0662-1C4D-965C-B5E530405CC8.root
.
475 files are not as few files as we originally understood. I checked in DBS and they were generated by the prod system/wmagent@vocms0280.cern.ch. Unless we prevent people to put the date as what they want to, we may get more of these kinds of dates.
Can you adjust your monitoring to ignore accessed_at > now+1hour instead of changing the DB?
Yes. People/tools should not put junk in the DB. I can't see how setting an access_at in the far future may be intentional. Can Rucio experts follow up with WMAgent experts and get things under control. You are right Yuyi that we do no know exactly how those dates got into the system. Which is why nobody could give you a clear directive yet.
CMS-RSEs.txt
Attached is the list of 475 dids' info. We could have more odd access_at
dids.
I add @amaltaro here, maybe he can shine some lights here.
We will have a CMS rucio meeting next Tuesday and I will bring up this issue.
@mrceyhun, is this prevent you from developing monitoring tools? Can you ignore anything larger than the current time in your code?
One compermise I could do if you could not handel this in monitoring would set the access_at to the updated_at. As I said ealier, I prefer not to change prod data without understanding the cause.
@yuyiguo I would be very interested to join as well. I'm trying to figure out how all those traces end up in Rucio and what we should do to be able to trust this last_accessed
attribute. What time is the meeting? The usual 15.15?
@Panos512 yes, it is the usual 15:15 on Tuesdays. See you then.
I looked into a few of these files in more detail and found they do not have a rule associated with them. See example below:
rucio list-rules --traverse cms:/store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root
ID ACCOUNT SCOPE:NAME STATE[OK/REPL/STUCK] RSE_EXPRESSION COPIES EXPIRES (UTC) CREATED (UTC)
---- --------- ------------ ---------------------- ---------------- -------- --------------- ---------------
When a file has no rules, The Reaper will delete it based on the tombstone which is the max of accesse_at and created_at. The file has two copies and one of them is on T0_CH_CERN_Tape. Why? Could this related to some deletion campaign?
rucio list-file-replicas cms:/store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SCOPE | NAME | FILESIZE | ADLER32 | RSE: REPLICA |
|---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cms | /store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root | 364.933 MB | eda019d7 | T2_US_MIT: davs://xrootd.cmsaf.mit.edu:1094/store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root |
| cms | /store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root | 364.933 MB | eda019d7 | T0_CH_CERN_Tape: root://eosctacms.cern.ch:1094//eos/ctacms/archive/cms//store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
@yuyiguo please note that rucio list-rules --traverse
does not work for files. See this from @dynamic-entropy
Apologies for joining this thread a bit late. I am somehow confused as of what problem is actually being reported here.
I had a look at one of the LFNs that Yuyi provided in her text file above, which I also copy here the DBS query: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/files?logical_file_name=/store/mc/RunIIAutumn18MiniAOD/GluGluToRadionToHHTo2G2l2nu_M-250_narrow_TuneCP5_PSWeights_13TeV-madgraph-pythia8/MINIAODSIM/102X_upgrade2018_realistic_v15-v1/270000/0F7299A2-C57A-344B-96F4-15E875A0E54C.root&detail=true
and the only time related attribute is last_modification_date
, which is correctly set to 2020.
Can someone please clarify what the actual problem is, including:
You are confusing things Alan. This has nothing to do with modification time in dbs
@amaltaro the issue is about last time a file was read, which is tracked in Rucio in order to decide when to delete an unlocked file.
We just talked about the issue in Rucio's meeting. The date 2038-01-19 is the max int number to unix timestamp. I will check where this may come into the system. My plan is to change them to the last_modification_date.
Thanks Stefano and Yuyi for clarifying it.
The only moment that WMAgent updates a file (replica?) information is when the replica is actually created. Looking at that code: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/RucioInjector/RucioInjectorPoller.py#L300-L302
the only replica attributes that are defined in the agent are: name, scope, bytes, state, adler32. Other attributes either come from Rucio default values or from different tools acting on these replicas.
If you need any other information from WM, please let me know, otherwise I would say WM is not the one defining these strange timestamps.
Thank you Alan for the WMAgent info.
Following up with the discussion at the Rucio meeting today:
accesse_at
in the database 2038-01-19 03:14:07
has its unix timestamp as 2,147,483,647 which is exactly the int32 max value. WMArchive: /topic/cms.jobmon.wmarchive
: Tracer gets the data from Metadata.Ts. CMSSWPOP: /topic/cms.swpop
and xrootd: /topic/xrootd.cms.aaa.ng
: tracer gets the data from the Ts field in the messages.
I could update the tracer to ignore any timestamps that are greater than the current time+12 hours or something like that, but we will lose the valid time. So I would ask the data source developers to check their side to make sure to use int64 if milliseconds are used.
@vkuznet, can you check WMArchive? I don't know whom to contact for xrootd and swpop. @nsmith- , @klannon , @ericvaandering , @dciangot , anyone can point me to the right contact persons?
Meanwhile, I will update the database for the existing 2038 timestamps using last_modification_date
.I guess I would say that anything saying it accessed a file more than 18 hours in the future is lying and should be mapped to the current time.
Yuyi, WMArchive service code does not manipulate any incoming data, but it does adds wmats
timestamp over here: https://github.com/dmwm/WMArchive/blob/master/src/go/wmarchive.go#L165 which is standard Go int64 data-type. All other fields, e.g. meta_data.ts
are provided by WMCore codebase over here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L442 where create_ts
comes from FJWR timestamp
over here https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L406
According to WMArchive records in ES, the timestamp are unix since epoch, e.g. https://monit-opensearch.cern.ch/dashboards/app/discover#/doc/60770470-8326-11ea-88fc-cfaa9841e350/monit_prod_wmarchive_raw_metric-2023-03-28?id=60bd5b9e11bc46c3a47a50dda7fa4527 and looks like:
...
"wmats": 1680024480,
...
"meta_data": {
"agent_ver": "2.1.7",
"crab_exit_code": -1,
"crab_id": "233",
"fwjr_id": "233-0",
"host": "vocms0106.cern.ch",
"jobstate": "failed",
"jobtype": "CRAB3",
"ts": 1680024439,
"wn_name": ""
},
...
"metadata": {
"hostname": "monit-amqsource-ee2e71080d.cern.ch",
"partition": "19",
"type_prefix": "raw",
"kafka_timestamp": 1680024481968,
"topic": "wmarchive_raw_metric",
"producer": "wmarchive",
"_id": "60bd5b9e11bc46c3a47a50dda7fa4527",
"type": "metric",
"version": "001",
"timestamp": 1680024481490
}
Above, the wmats
is filed by WMArchive service, the meta_data.ts
is filled by WMCore code, and metadata.timestamp is filled by MONIT (which is represented in milliseconds since epoch). I do not know details which timestamp Rucio tracer use but you have full info now about timestamps and can make your own conclusion.
Valentin, thank you for your quick response. Rucio tracer uses WMArchive meta_data.ts that is in seconds since epoch. It's in the right data type too. So the problem was not from WMArchive.
The problem is caused by a source that uses milliseconds since epoch with data type int32. Then, Rucio will get a timestamp 2038-01-19 03:14:07
.
More studies found the odd accessed_at=2038-01-19 03:14:07
are not from rucio tracer. We use one file as an example to demonstrate this conclusion.
RSE_ID SCOPE NAME BYTES MD5 ADLER32 PATH STATE LOCK_CNT ACCESSED_AT TOMBSTONE UPDATED_AT CREATED_AT
"2694B6BD279F4BAA890F5ABBE66351D2" "cms" "/store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root" 477003379 (null) "aa5a7a03" (null) "A" 0 "2038-01-19 03:14:07" "2038-01-19 03:14:07" "2023-03-21 11:49:04" "2020-09-22 09:55:32"
The file was created_at=2020-09-22 09:55:32
, but rucio tracer was created around Sep 2021. So this file was already in the rucio for a year before the tracer was created. Then the file was updated_at=2023-03-21 11:49:04
. Who/what updated the file? For sure it was not the tracer because updated_at
is not in the trace definition so the tracer will not change updated_at
. I suspect that accessed_at
was changed when the updated_at
was changed. The same file has tree replicas and only the one with the new update_at=2023-03-21 11:49:04
has accessed_at=2038-01-19 03:14:07
.
RSE_ID SCOPE NAME BYTES MD5 ADLER32 PATH STATE LOCK_CNT ACCESSED_AT TOMBSTONE UPDATED_AT CREATED_AT
F44C866A264D4DA9972969E9F3B5BB52 cms /store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root 477003379 (null) aa5a7a03 (null) A 1 (null) (null) 2020-10-06 10:59:08 2020-10-06 10:59:08
DBDAF50EE9D24F92AD24473539522204 cms /store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root 477003379 (null) aa5a7a03 (null) A 1 (null) (null) 2020-10-29 22:04:32 2020-10-29 22:04:32
2694B6BD279F4BAA890F5ABBE66351D2 cms /store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root 477003379 (null) aa5a7a03 (null) A 0 2038-01-19 03:14:07 2038-01-19 03:14:07 2023-03-21 11:49:04 2020-09-22 09:55:32
I will continue to contact the swpop and xrootd AMQ developers to make sure we get the correct timestamps from them.
All 473 replicas that have accessed_at > Jan-1-2038 are updated as below:
update replicas
set accessed_at=updated_at, tombstone=updated_at
WHERE accessed_at > to_date('01-Jan-38','DD-MON-YY') ;
The table was updated. I did not get additional info and my investigation showed that the 2038 timestamps were not from rucio traces. So I am closing this ticket. Reopen it if new info appears.
Hi @yuyiguo
Is there more to be done here? Was the reopening of the issue intentional?
I see that the accessed_at
fields did not get any new 2038 timestamps after you corrected them.
It is done but forgot to close it. Close now.
Not sure at which level of the toolchain this should happen, but we should focus on getting the right information into our monitoring system. What do you think @yuyiguo ? which options do we have for this? Do we have an open issue somewhere already?
For reference, this is the MM dissussion that triggered this requests