HTTP GET on newly uploaded file returns 404

bgounon commented 4 years ago

Context:

Using Go's "net/http" library to GET a file from the webdav door immediately after it is uploaded provokes a 404 error. We consider the file uploaded when a storage event with the following attributes is generated and published on our dCache Kafka topic:

{
...
"msgType": "transfer"
...
"isWrite": "write"
...
}

Workaround:

This behavior is successfully fixed by implementing a 1 second pause between reception of the storage event and execution of the GET request.

How to reproduce:

In a dCache environment where storage events are configured, have some mechanism ready to immediately run an HTTP GET request to https://my.dcache.com:2880/path/to/myfile.tar upon reception of an event with attributes "msgType": "transfer" and "isWrite": "write". Upload myfile.tar to the pool.

Request response: Status code: 404 Body:

...
          </div>
              <div class="alert alert-warning">
                 HTTP/1.1 404 Not Found FILE NOT FOUND
              </div>
          </div>
...

Code snippets:

Parsing messages (via sarama):

for msg := range claim.Messages() { //range over all messages published in the topic
    if isUpload(*msg) { //returns true if the required attributes are set
        go getFile(*msg)
    }
}

Getting the file (via http):

req, err := http.NewRequest("GET", fileURL, nil)
if err != nil {
    return err
}
//time.Sleep(1 * time.Second)
resp, err := client.Do(req)
if err != nil {
    return err
}
defer resp.Body.Close()
fmt.Printf("%v", resp.StatusCode) //will print 404 if we don't sleep before client.Do()

paulmillar commented 4 years ago

Hi Bastien,

Thanks for the detailed report.

Could you try an experiment, by making a small change to the client to poll for a non-404 response, and give an indication how many times (or how long) it took for receive a non-404 response?

I don't know go, but I guess this would be something like:

    req, err := http.NewRequest("GET", fileURL, nil)
    if err != nil {
        return err
    }
    for count := 0; count < 50; count++ {
        resp, err := client.Do(req)
        if err != nil {
            return err
        }
        if rep.StatusCode != 404 {
            break
        }
        time.Sleep(20* time.Milliseconds)
    }
    defer resp.Body.Close()
    fmt.Printf("%v %d", rep.StatusCode, count)

Could you run this code (or something like it) and provide us with the time needed before dCache returns a non-404 response, along with the access log file entries corresponding to one file?

You may want to play with the 20 ms delay, reducing the value. This would allow you to get a more accurate estimation of how long after the kafka event dCache returns a 404 status code.

bgounon commented 4 years ago

Hi Paul,

Here is a sample output with the experimental code, with the sleeping time set to only 1ms:

2020/01/30 11:38:13 [COPY_START] Getting data from https://my.dcache.com:2880/path/to/myfile.tar (4f9e1ad7-d0de-cdfc-ea24-ab55ebbf9a21)
0/50: code=404
1/50: code=404
2/50: code=404
3/50: code=404
4/50: code=200
OUT OF LOOP: code=200
2020/01/30 11:38:13 [COPY_RUN] Copying data to /tmp/test/myfile.tar (4f9e1ad7-d0de-cdfc-ea24-ab55ebbf9a21)

As you can see, it looks like we've waited more or less 5ms before resp.StatusCode changes from 404 to 200. I've done more runs which show that we usually loop 3 to 10 times before the 404 disappears.

Hope this helps

paulmillar commented 4 years ago

Hi Bastien,

That certainly is very interesting information.

A couple of quick questions:

Do you run multiple PnfsManager instances, perhaps with multiple PostgreSQL instances (master-slave) ?

With which protocol are these files being written?

Cheers, Paul.

paulmillar commented 4 years ago

Using prometheus.desy.de, I can confirm that (in the current tip of master), making an HTTP GET request for an incomplete file results in a 500 INTERNAL SERVER ERROR status line.

paul@celebrimbor:~$ curl --limit-rate 100 -L -T ~/Videos/Daft\ Hands.webm -H "Authorization: bearer $(oidc-token ESCAPE)" https://prometheus.desy.de/Users/paul/test.webm
^Z
[1]+  Stopped                 curl --limit-rate 100 -L -T ~/Videos/Daft\ Hands.webm -H "Authorization: bearer $(oidc-token ESCAPE)" https://prometheus.desy.de/Users/paul/test.webm
paul@celebrimbor:~$ curl -s -D- -o/dev/null https://prometheus.desy.de/Users/paul/test.webm 
HTTP/1.1 500 Server Error
Date: Thu, 30 Jan 2020 15:08:53 GMT
X-Clacks-Overhead: GNU Terry Pratchett
Server: dCache/6.1.0-SNAPSHOT
Access-Control-Allow-Credentials: true
Accept-Ranges: bytes
ETag: "0000A7667ECAD11A4B3AAEFC504318258A3D_-151035869"
Last-Modified: Thu, 30 Jan 2020 15:08:49 GMT
Content-Type: text/html
Cache-Control: no-cache
Content-Disposition: attachment
Transfer-Encoding: chunked

paul@celebrimbor:~$

ageorget commented 4 years ago

Do you run multiple PnfsManager instances, perhaps with multiple PostgreSQL instances (master-slave) ?

Hi Paul, This dCache instance has 2 PnfsManager cells, I found nothing in the logs related to this.
The databases are host on a PostgreSQL cluster without any master-slaver configuration. Do you think it can be a sync delay between the 2 PnfsManager cells? I can try to shutdown one of them to check.

Cheers, Adrien

paulmillar commented 4 years ago

Yes, please try switching one of the PnfsManager instances off.

bgounon commented 4 years ago

Hi Paul, We tried running with a single PnfsManager cell and experienced the same behavior. The files are written to dCache via srm. Cheers, Bastien

paulmillar commented 4 years ago

That's good news that switching off the second PnfsManager has no effect: dCache should work fine with two PnfsManagers.

I believe I now understand the problem: it comes from how SRM uploads work.

When uploading a file with SRM, a special disposable directory is created for the upload. The client uploads the file into this directory, typically using the GridFTP protocol (but any supported protocol will work). When the SRM client tells dCache that the upload is complete, the file is moved from the disposable directory and the directory is deleted.

If I remember correctly, your GO agent is reacting to the pool transfer-complete event. This is sent when the pool has finished receiving the file's data. In an SRM upload, this is when the GridFTP upload completes. However, at this point, the file exists only in the disposable directory. It is only when the SRM client tells dCache the upload is complete (using the srmPutDone command) that the file is moved into the correct location in the namespace.

You should be able to confirm this by looking at the billing entries (either the billing log files or the billing database). dCache will log (in billing) when srmPutDone completes successfully.

Unfortunately, we currently do not generate kafka events on SRM door activity, therefore it is currently impossible to know when an SRM-uploaded file is ready to read using kafka. I've created issue #5314 to track this deficiency.

In the absence of the SRM events, you can either continue to poll for the file's availability or update the clients to use another protocol for uploading data into dCache: direct (i.e., not SRM-mediated) GridFTP and HTTP/WebDAV would be two obvious choices.

bgounon commented 4 years ago

Thank you for your insights, Paul ! We will continue to rely on our 1-second-wait workaround for the moment as it is not a blocking issue for us, and will likely try to move to Webdav protocol for grid transfers in the future. Cheers,

paulmillar commented 4 years ago

OK, I think we understand why dCache is responding with a 404 status code immediately after the upload completes.

Moving to direct (i.e., non-SRM based) upload (e.g., GridFTP or WebDAV) should resolve this issue. Another possible solution would be to update dCache to send kafka events when the SRM client declares the transfer as successful (see #5314) and the GO client updated to use these new events.

Therefore, I think we can close this issue.

dCache / dcache