dmwm / CMSRucio

7 stars 31 forks source link

Overwrite 2038 timestamp on access dataset timestamp in monitoring #439

Closed dciangot closed 1 year ago

dciangot commented 1 year ago

Not sure at which level of the toolchain this should happen, but we should focus on getting the right information into our monitoring system. What do you think @yuyiguo ? which options do we have for this? Do we have an open issue somewhere already?

For reference, this is the MM dissussion that triggered this requests

yuyiguo commented 1 year ago

I am not sure I understood the problem, but from the MM discussion, it seemed that the problem was a bug in Granfana (https://mattermost.web.cern.ch/cms-o-and-c/pl/nan1h5h5mjfa3y4kzdmyyp4cfe) . So this ticket can be closed? @dciangot

belforte commented 1 year ago

unfortunately MM threads always cover many things. There are a few files where the Rucio last-accessed-time is set to the max possible integer (year 2038) and this confuse monitoring tools who look for latest access time of a dataset by picking MAX of the last access time of all files. Panos checked and found that the spurious 2038 is for a few files created circa 2018. Can you please overwrite that in the DB with something which is about year 2010 ?

belforte commented 1 year ago

@yuyiguo this is the correct thread image

yuyiguo commented 1 year ago

Ok, Thanks @belforte for explaining the problem. Can you or someone let me what the files DIDs and what the last-accessed-time? should be? 01/01/2018?

belforte commented 1 year ago

@mrceyhun or @panos512 can you answer this ? Yuyi, in my mind the idea was that if they can tell you the "bad value", you can select for that in the DB, w/o needing a list of DIDs. Maybe one or two file names as example will do.

yuyiguo commented 1 year ago

I would prefer to only change the confirmed rows because this is production DB.

mrceyhun commented 1 year ago

@yuyiguo I think you can find them easily with a simple SQL query. Just query the accessed_at greater than current timestamp. For instance: select count(*) from CMS_RUCIO_PROD.REPLICAS where ACCESSED_AT>=DATE '2023-03-25' (3 days of safety :) )

yuyiguo commented 1 year ago

What date you want to change to?

yuyiguo commented 1 year ago

If the files came to Rucio in 2018, why they showed as a problem now? Do you know all the DIDs have future dates from 2018?

mrceyhun commented 1 year ago

@yuyiguo I want to help you but unfortunately I don't know better than you. I can provide you the DIDs but I will use above query that I shared. And I don't know which date they should be set. From a monitoring point of view, I don't see a difference between setting their accessed at as 2018 or NULL :) This is just my opinion.

yuyiguo commented 1 year ago

Since I don't know the data and how the problem was made. Someone who knows the data and the correct number let me know the details. I haven't checked the DB yet, but in most cases, NULL will not be accepted. The query you gave would cover all the dates that were not meaningful, but they could come from different years than 2018. We should not just put junk in the DB to fix one problem and cause other problems in the future. Panos already researched and seemed to know the data, Let's wait for his reply.

yuyiguo commented 1 year ago

Looked into the DB, and we have 475 dids that have accessed_at > 01-Jan-38. Some of them were just updated yesterday/2023-03-21 10:15:16. A lot of the were created in 2020. Why is their accessed_at set to 2038? People don't want the data to be deleted? They looked like all mc data, ex name: /store/mc/Run3Winter20DRMiniAOD/DoublePhoton_FlatPt-1000To1500/AODSIM/FlatPU0to80_110X_mcRun3_2021_realistic_v6-v1/110000/9231CCE7-0662-1C4D-965C-B5E530405CC8.root .

yuyiguo commented 1 year ago

475 files are not as few files as we originally understood. I checked in DBS and they were generated by the prod system/wmagent@vocms0280.cern.ch. Unless we prevent people to put the date as what they want to, we may get more of these kinds of dates.

Can you adjust your monitoring to ignore accessed_at > now+1hour instead of changing the DB?

belforte commented 1 year ago

Yes. People/tools should not put junk in the DB. I can't see how setting an access_at in the far future may be intentional. Can Rucio experts follow up with WMAgent experts and get things under control. You are right Yuyi that we do no know exactly how those dates got into the system. Which is why nobody could give you a clear directive yet.

yuyiguo commented 1 year ago

CMS-RSEs.txt Attached is the list of 475 dids' info. We could have more odd access_at dids. I add @amaltaro here, maybe he can shine some lights here.

yuyiguo commented 1 year ago

We will have a CMS rucio meeting next Tuesday and I will bring up this issue. @mrceyhun, is this prevent you from developing monitoring tools? Can you ignore anything larger than the current time in your code?
One compermise I could do if you could not handel this in monitoring would set the access_at to the updated_at. As I said ealier, I prefer not to change prod data without understanding the cause.

Panos512 commented 1 year ago

@yuyiguo I would be very interested to join as well. I'm trying to figure out how all those traces end up in Rucio and what we should do to be able to trust this last_accessed attribute. What time is the meeting? The usual 15.15?

yuyiguo commented 1 year ago

@Panos512 yes, it is the usual 15:15 on Tuesdays. See you then.

yuyiguo commented 1 year ago

I looked into a few of these files in more detail and found they do not have a rule associated with them. See example below:

 rucio list-rules --traverse cms:/store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root
ID    ACCOUNT    SCOPE:NAME    STATE[OK/REPL/STUCK]    RSE_EXPRESSION    COPIES    EXPIRES (UTC)    CREATED (UTC)
----  ---------  ------------  ----------------------  ----------------  --------  ---------------  ---------------

When a file has no rules, The Reaper will delete it based on the tombstone which is the max of accesse_at and created_at. The file has two copies and one of them is on T0_CH_CERN_Tape. Why? Could this related to some deletion campaign?

 rucio list-file-replicas cms:/store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SCOPE   | NAME                                                                                                                                                                              | FILESIZE   | ADLER32   | RSE: REPLICA                                                                                                                                                                                                                                             |
|---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cms     | /store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root | 364.933 MB | eda019d7  | T2_US_MIT: davs://xrootd.cmsaf.mit.edu:1094/store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root                             |
| cms     | /store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root | 364.933 MB | eda019d7  | T0_CH_CERN_Tape: root://eosctacms.cern.ch:1094//eos/ctacms/archive/cms//store/mc/RunIIFall17MiniAODv2/ZPhi_2mu_M-50_PS_TuneCP5_madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/270000/DCC5172C-91EC-EA11-8011-0025B3E01D9A.root |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
belforte commented 1 year ago

@yuyiguo please note that rucio list-rules --traverse does not work for files. See this from @dynamic-entropy

amaltaro commented 1 year ago

Apologies for joining this thread a bit late. I am somehow confused as of what problem is actually being reported here.

I had a look at one of the LFNs that Yuyi provided in her text file above, which I also copy here the DBS query: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/files?logical_file_name=/store/mc/RunIIAutumn18MiniAOD/GluGluToRadionToHHTo2G2l2nu_M-250_narrow_TuneCP5_PSWeights_13TeV-madgraph-pythia8/MINIAODSIM/102X_upgrade2018_realistic_v15-v1/270000/0F7299A2-C57A-344B-96F4-15E875A0E54C.root&detail=true

and the only time related attribute is last_modification_date, which is correctly set to 2020.

Can someone please clarify what the actual problem is, including:

belforte commented 1 year ago

You are confusing things Alan. This has nothing to do with modification time in dbs

belforte commented 1 year ago

@amaltaro the issue is about last time a file was read, which is tracked in Rucio in order to decide when to delete an unlocked file.

yuyiguo commented 1 year ago

We just talked about the issue in Rucio's meeting. The date 2038-01-19 is the max int number to unix timestamp. I will check where this may come into the system. My plan is to change them to the last_modification_date.

amaltaro commented 1 year ago

Thanks Stefano and Yuyi for clarifying it.

The only moment that WMAgent updates a file (replica?) information is when the replica is actually created. Looking at that code: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/RucioInjector/RucioInjectorPoller.py#L300-L302

the only replica attributes that are defined in the agent are: name, scope, bytes, state, adler32. Other attributes either come from Rucio default values or from different tools acting on these replicas.

If you need any other information from WM, please let me know, otherwise I would say WM is not the one defining these strange timestamps.

yuyiguo commented 1 year ago

Thank you Alan for the WMAgent info.

yuyiguo commented 1 year ago

Following up with the discussion at the Rucio meeting today:

ericvaandering commented 1 year ago

I guess I would say that anything saying it accessed a file more than 18 hours in the future is lying and should be mapped to the current time.

vkuznet commented 1 year ago

Yuyi, WMArchive service code does not manipulate any incoming data, but it does adds wmats timestamp over here: https://github.com/dmwm/WMArchive/blob/master/src/go/wmarchive.go#L165 which is standard Go int64 data-type. All other fields, e.g. meta_data.ts are provided by WMCore codebase over here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L442 where create_ts comes from FJWR timestamp over here https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L406

According to WMArchive records in ES, the timestamp are unix since epoch, e.g. https://monit-opensearch.cern.ch/dashboards/app/discover#/doc/60770470-8326-11ea-88fc-cfaa9841e350/monit_prod_wmarchive_raw_metric-2023-03-28?id=60bd5b9e11bc46c3a47a50dda7fa4527 and looks like:

...
      "wmats": 1680024480,
...
      "meta_data": {
        "agent_ver": "2.1.7",
        "crab_exit_code": -1,
        "crab_id": "233",
        "fwjr_id": "233-0",
        "host": "vocms0106.cern.ch",
        "jobstate": "failed",
        "jobtype": "CRAB3",
        "ts": 1680024439,
        "wn_name": ""
      },
...
    "metadata": {
      "hostname": "monit-amqsource-ee2e71080d.cern.ch",
      "partition": "19",
      "type_prefix": "raw",
      "kafka_timestamp": 1680024481968,
      "topic": "wmarchive_raw_metric",
      "producer": "wmarchive",
      "_id": "60bd5b9e11bc46c3a47a50dda7fa4527",
      "type": "metric",
      "version": "001",
      "timestamp": 1680024481490
    }

Above, the wmats is filed by WMArchive service, the meta_data.ts is filled by WMCore code, and metadata.timestamp is filled by MONIT (which is represented in milliseconds since epoch). I do not know details which timestamp Rucio tracer use but you have full info now about timestamps and can make your own conclusion.

yuyiguo commented 1 year ago

Valentin, thank you for your quick response. Rucio tracer uses WMArchive meta_data.ts that is in seconds since epoch. It's in the right data type too. So the problem was not from WMArchive.

The problem is caused by a source that uses milliseconds since epoch with data type int32. Then, Rucio will get a timestamp 2038-01-19 03:14:07.

yuyiguo commented 1 year ago

More studies found the odd accessed_at=2038-01-19 03:14:07 are not from rucio tracer. We use one file as an example to demonstrate this conclusion.

RSE_ID  SCOPE   NAME    BYTES   MD5 ADLER32 PATH    STATE   LOCK_CNT    ACCESSED_AT TOMBSTONE   UPDATED_AT  CREATED_AT
"2694B6BD279F4BAA890F5ABBE66351D2"  "cms"   "/store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root" 477003379   (null)  "aa5a7a03"  (null)  "A" 0   "2038-01-19 03:14:07"   "2038-01-19 03:14:07"   "2023-03-21 11:49:04"   "2020-09-22 09:55:32"

The file was created_at=2020-09-22 09:55:32, but rucio tracer was created around Sep 2021. So this file was already in the rucio for a year before the tracer was created. Then the file was updated_at=2023-03-21 11:49:04. Who/what updated the file? For sure it was not the tracer because updated_at is not in the trace definition so the tracer will not change updated_at. I suspect that accessed_at was changed when the updated_at was changed. The same file has tree replicas and only the one with the new update_at=2023-03-21 11:49:04 has accessed_at=2038-01-19 03:14:07.

RSE_ID  SCOPE   NAME    BYTES   MD5 ADLER32 PATH    STATE   LOCK_CNT    ACCESSED_AT TOMBSTONE   UPDATED_AT  CREATED_AT
F44C866A264D4DA9972969E9F3B5BB52    cms /store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root   477003379   (null)  aa5a7a03    (null)  A   1   (null)  (null)  2020-10-06 10:59:08 2020-10-06 10:59:08
DBDAF50EE9D24F92AD24473539522204    cms /store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root   477003379   (null)  aa5a7a03    (null)  A   1   (null)  (null)  2020-10-29 22:04:32 2020-10-29 22:04:32
2694B6BD279F4BAA890F5ABBE66351D2    cms /store/mc/RunIISummer16MiniAODv3/ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M210_TuneCUETP8M1_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/PUMoriond17_94X_mcRun2_asymptotic_v3-v1/280000/402192B0-9AF2-EA11-ACE3-AC1F6BD5A0DE.root   477003379   (null)  aa5a7a03    (null)  A   0   2038-01-19 03:14:07 2038-01-19 03:14:07 2023-03-21 11:49:04 2020-09-22 09:55:32

I will continue to contact the swpop and xrootd AMQ developers to make sure we get the correct timestamps from them.

yuyiguo commented 1 year ago

All 473 replicas that have accessed_at > Jan-1-2038 are updated as below:

update replicas
set accessed_at=updated_at, tombstone=updated_at
WHERE accessed_at > to_date('01-Jan-38','DD-MON-YY') ;
yuyiguo commented 1 year ago

The table was updated. I did not get additional info and my investigation showed that the 2038 timestamps were not from rucio traces. So I am closing this ticket. Reopen it if new info appears.

dynamic-entropy commented 1 year ago

Hi @yuyiguo Is there more to be done here? Was the reopening of the issue intentional? I see that the accessed_at fields did not get any new 2038 timestamps after you corrected them.

yuyiguo commented 1 year ago

It is done but forgot to close it. Close now.