dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
276 stars 132 forks source link

When pool with restore goes down, it triggers a second restore request #7586

Closed lemora closed 1 month ago

lemora commented 1 month ago

Similarly, when a request in dCache is cancelled that has already trigered a restore to be started on the tape system side, this can result in a second restore because dCache is no longer aware of/tracking the first one.

Additionally, when trying to access an existing disk copy after a pool with the file comes up, PoolManager refuses to serve it because it sees that a restore is ongoing and assumes that there is no replica on disk.

The second issue probably need to be fixed in https://github.com/dCache/dcache/blob/052d56b970cadbba2ac3e212870eaf50cb13883b/modules/dcache/src/main/java/diskCacheV111/poolManager/RequestContainerV5.java#L833

DmitryLitvintsev commented 1 month ago

What does it mean "pool with restore"? Does this mean a pool that has restore in queue? If it goes down all restores are gone and no replica of that file there. The issue, I thought, was with just a pool going down with all the replicas on it unavailable. Users try to access the files, they trigger stage. Meanwhile pool comes back up, but none of the files being staged can be read because there are staging requests in PM (Issue #7587)

lemora commented 1 month ago

Well, the phrasing is perhaps a bit convoluted. We also have the issue that we are not able to cancel ongoing restores on the tape side, but that might be hard to address.

Nobody had opened the agreed-upon issue when I checked yesterday, so I created this one. The second part (with the reference to RequestContainerV5) is now redundant to https://github.com/dCache/dcache/issues/7587, of course.

I'll close this issue and we can follow-up in the other one.