dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
284 stars 135 forks source link

billing: some remove messages lack DN/FQAN information #7245

Open calestyo opened 1 year ago

calestyo commented 1 year ago

Hey.

We recently had an issue where files got "missing" but where it turned out that most likely ATLAS had just deleted them earlier.

Alongside of debugging that, I noticed that for many "door-request-info" message we don't see any DN/FQAN in the billing log entry, despite being configured in our billing log format strings:

billing.text.format.door-request-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$drMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$clientChain$]$\\t$$$$transactionTime$ms$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]
billing.text.format.pool-hit-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$phMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$protocol$]$\\t$$$$if(cached)$cached$else$not-cached$endif$$$$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]
billing.text.format.storage-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$sMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$$$$transferTime$ms$\\t$[$rc$:"$message$"]
billing.text.format.mover-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$mMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$protocol$]$\\t$[$initiator$]$\\t$$$$if(p2p)$p2p$else$no-p2p$endif$$$$\\t$$$$if(created)$upload$else$download$endif$$$$\\t$$$$transferred$B$\\t$$$$meanReadBandwidth$B/s$\\t$$$$meanWriteBandwidth$B/s$\\t$$$$connectionTime$ms$\\t$$$$readActive$ms$\\t$$$$readIdle$ms$\\t$$$$writeActive$ms$\\t$$$$writeIdle$ms$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]
billing.text.format.remove-file-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$rfMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$rc$:"$message$"]
billing.text.format.warning-pnfs-file-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$wpfMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]

A concrete log entry, where the info is missing looks like:

billing-2023.07.01:2023-07-01 04:24:13+02:00    drMsg:  [door:webdav.tls_lcg-lrz-dc36@webdav_lcg-lrz-dc36:remove]   [door:webdav.tls_lcg-lrz-dc36@webdav_lcg-lrz-dc36:1688178253327-490853] [0000841D6A56EDE047699BA6A6B39B2EDEC8:/pnfs/lrz-muenchen.de/data/atlas/dq2/atlasscratchdisk/rucio/panda/28/b5/panda.um.user.hiito.33917626._000421.hist-output.root]    10018B  [<unknown>] []  []  [[]:[]] [prdatl01]  [50201] [1307:1307] 0ms [unknown]   0ms [/pnfs/lrz-muenchen.de/data/atlas/dq2/atlasscratchdisk/rucio/panda/28/b5/panda.um.user.hiito.33917626._000421.hist-output.root] [0:""]

one where it's not missing like:

billing-2023.07.05:2023-07-05 21:10:52+02:00    drMsg:  [door:webdav.tls_lcg-lrz-dc35@webdav_lcg-lrz-dc35:remove]   [door:webdav.tls_lcg-lrz-dc35@webdav_lcg-lrz-dc35:1688584252881-574377] [000079F49D6B2CA2473791F504BC9E98729B:/pnfs/lrz-muenchen.de/data/ops/test_webdav_access_84_moved]   121B    [<unknown>] []  [/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-egi@grnet.gr]    [[/ops]:[/ops]] [ops001]    [45001] [45000:45000]   0ms [unknown]   0ms [/pnfs/lrz-muenchen.de/data/ops/test_webdav_access_84_moved]    [0:""]

with the following numbers of cases (for e.g. this month):

/var/lib/dcache/billing/2023/07# grep ":remove" billing-2023.07.* | grep -v cleaner | grep $'\[[]:[]]\t\[' | wc -l
897369
/var/lib/dcache/billing/2023/07# grep ":remove" billing-2023.07.* | grep -v cleaner | grep -v $'\[[]:[]]\t\[' | wc -l
2457

so for nearly all transfers, the information is missing. And from those where it's not:

/var/lib/dcache/billing/2023/07# grep ":remove" billing-2023.07.* | grep -v cleaner | grep -v $'\[[]:[]]\t\[' | grep '\[ops001]' | wc -l
2276

The bigger part is from ops tests.

So I'd presume there's some systematic difference in how these deletions are done.

Perhaps ATLAS uses macaroons, and then the info is missing?
But that would IMO be quite a deficiency in the system, as it seems to make tracking down who's to blame quite difficult.

So if it would be indeed macaroons,... couldn't it be made, that dcache keeps the information from which DN/FQAN the macaroon used for authentication was generated and include that in the billing?

Cheers, Chris.

lemora commented 11 months ago

Hi Chris.

Which dCache version is this?

Cheers, Lea

calestyo commented 11 months ago

Hey.

9.1.2.

Cheers, Chris

lemora commented 11 months ago

Okay. To be honest, this is very likely token-related, and in that case there is nothing we can do to retrrieve and log a DN/FQAN.

calestyo commented 11 months ago

Phew... that sounds pretty serious if one could't trace down which user is responsible for some action.