dandi / dandi-archive

DANDI API server and Web app
https://dandiarchive.org
14 stars 10 forks source link

503 Application Error in HTML response upon POST to https://api.dandiarchive.org/api/uploads/initialize/ #185

Closed yarikoptic closed 2 years ago

yarikoptic commented 3 years ago

In follow up to #184 decided to reupload and it failed with Application Error

drogon:/tmp/000029
*$> DANDI_DEVEL=1 dandi upload -i dandi-api --allow-any-path  --validation skip --jobs 2:10 largeuploads/blobs/f5d/aac/79d9c6c1ba2439200d45680f1cb941087750a000e484ea2973232709d2 
2021-03-25 14:02:06,376 [    INFO] Found 1 files to consider
PATH                                                                                  SIZE    ERRORS UPLOAD STATUS                       MESSAGE                                                                                       
largeuploads/blobs/f5d/aac/79d9c6c1ba2439200d45680f1cb941087750a000e484ea2973232709d2 3.6 TB                ERROR                        Error 503 while sending POST request to https://api.dandiarchive.org/api/uploads/initialize/  
Summary:                                                                              3.6 TB                1 ERROR                      1 Error 503 while sending POST request to https://api.dandiarchive.org/api/uploads/initialize/
2021-03-25 14:02:39,326 [    INFO] Logs saved in /home/yoh/.cache/dandi-cli/log/20210325180158Z-12041.log
(dandi-devel) 
#92 !1223 [0].....................................:Thu Mar 25 14:02:39:.
drogon:/tmp/000029
$> cat /home/yoh/.cache/dandi-cli/log/20210325180158Z-12041.log
2021-03-25T14:01:58-0400 [INFO    ] dandi 12041:139975036532096 [META] sys.argv = ['/home/yoh/miniconda3/envs/dandi-devel/bin/dandi', 'upload', '-i', 'dandi-api', '--allow-any-path', '--validation', 'skip', '--jobs', '2:10', 'largeuploads/blobs/f5d/aac/79d9c6c1ba2439200d45680f1cb941087750a000e484ea297
3232709d2']
2021-03-25T14:01:58-0400 [INFO    ] dandi 12041:139975036532096 [META] os.getcwd() = /tmp/000029
2021-03-25T14:02:06-0400 [INFO    ] dandi 12041:139975036532096 Found 1 files to consider
2021-03-25T14:02:39-0400 [ERROR   ] dandi 12041:139974440777472 Error 503 while sending POST request to https://api.dandiarchive.org/api/uploads/initialize/: <!DOCTYPE html>
        <html>
          <head>
                <meta name="viewport" content="width=device-width, initial-scale=1">
                <meta charset="utf-8">
                <title>Application Error</title>
                <style media="screen">
                  html,body,iframe {
                        margin: 0;
                        padding: 0;
                  }
                  html,body {
                        height: 100%;
                        overflow: hidden;
                  }
                  iframe {
                        width: 100%;
                        height: 100%;
                        border: 0;
                  }
                </style>
          </head>
          <body>
                <iframe src="//www.herokucdn.com/error-pages/application-error.html"></iframe>
          </body>
        </html>
2021-03-25T14:02:39-0400 [INFO    ] dandi 12041:139975036532096 Logs saved in /home/yoh/.cache/dandi-cli/log/20210325180158Z-12041.log
(dandi-devel) 
#93 !1224 [0].....................................:Thu Mar 25 14:02:51:.
drogon:/tmp/000029
$> dandi --version
0.12.1+4.gd41969c
waxlamp commented 3 years ago

@dchiquito tell me that Heroku was down yesterday, which is (hopefully) where this error came from.

If you try again now, does it work?

yarikoptic commented 3 years ago

Makes sense. I think it worked after. Thanks!

yarikoptic commented 3 years ago
nope -- still there/happened again ```shell (dandi-devel) #100 !1231 [0].....................................:Fri Mar 26 12:13:11:. drogon:/tmp/000029 $> cat /home/yoh/.cache/dandi-cli/log/20210326161238Z-1414.log 2021-03-26T12:12:38-0400 [INFO ] dandi 1414:140130935436672 [META] sys.argv = ['/home/yoh/miniconda3/envs/dandi-devel/bin/dandi', '-l', '1', 'upload', '-i', 'dandi-api', '--allow-any-path', '--validation', 'skip', '--jobs', '1:4', '--devel-debug', 'largeuploads/blobs/f5d/aac/79d9c6c1ba2439200d45680f1cb941087750a000e484ea2973232709d2'] 2021-03-26T12:12:38-0400 [INFO ] dandi 1414:140130935436672 [META] os.getcwd() = /tmp/000029 2021-03-26T12:12:38-0400 [DEBUG ] dandi 1414:140130935436672 Running a newer version (0.12.1+4.gd41969c.dirty) of dandi/dandi-cli than available (0.12.1) 2021-03-26T12:12:39-0400 [DEBUG ] dandi 1414:140130935436672 Found identifier DANDI:000029 in top level 'identifier' 2021-03-26T12:12:40-0400 [INFO ] dandi 1414:140130935436672 Found 1 files to consider 2021-03-26T12:12:40-0400 [DEBUG ] dandi 1414:140130935436672 GET https://api.dandiarchive.org/api/dandisets/000029/versions/draft/assets/ 2021-03-26T12:12:40-0400 [DEBUG ] dandi 1414:140130935436672 Response: 200 2021-03-26T12:12:40-0400 [DEBUG ] dandi 1414:140130935436672 /mnt/backup/dandi/dandiarchive-s3-backup/.git/annex/objects/j3/XZ/MD5E-s3628888939868--c8e495cbe418147a3437cc92a66e673b/MD5E-s3628888939868--c8e495cbe418147a3437cc92a66e673b has no nwb_version 2021-03-26T12:12:40-0400 [DEBUG ] dandi 1414:140130935436672 Calculated dandi-etag of b0a8ccd03697bbfc361a686ac6968598-10000 for largeuploads/blobs/f5d/aac/79d9c6c1ba2439200d45680f1cb941087750a000e484ea2973232709d2 2021-03-26T12:12:40-0400 [DEBUG ] dandi 1414:140130935436672 largeuploads/blobs/f5d/aac/79d9c6c1ba2439200d45680f1cb941087750a000e484ea2973232709d2: Beginning upload 2021-03-26T12:12:40-0400 [DEBUG ] dandi 1414:140130935436672 POST https://api.dandiarchive.org/api/uploads/initialize/ 2021-03-26T12:13:10-0400 [DEBUG ] dandi 1414:140130935436672 Response: 503 2021-03-26T12:13:10-0400 [ERROR ] dandi 1414:140130935436672 Error 503 while sending POST request to https://api.dandiarchive.org/api/uploads/initialize/: Application Error 2021-03-26T12:13:10-0400 [DEBUG ] dandi 1414:140130935436672 Caught exception Error 503 while sending POST request to https://api.dandiarchive.org/api/uploads/initialize/ 2021-03-26T12:13:10-0400 [INFO ] dandi 1414:140130935436672 Logs saved in /home/yoh/.cache/dandi-cli/log/20210326161238Z-1414.log (dandi-devel) ```
yarikoptic commented 3 years ago

https://status.heroku.com/ reports no known issues (and indeed reported some yesterday)

yarikoptic commented 3 years ago

may be it is some kind of rate limiting or other related setting etc? there is another parallel upload is happening from the same box.

yarikoptic commented 3 years ago

I think this I spotted in the papertrail while trying again would relate (sorry @mgrauer , I do post it here since "Sentry is attempting to send 0 pending error messages" so I guess nobody see those and nobody analyzes logs ATM, and I do not see any sensitive information I might be leaking)

Mar 26 09:25:49 dandi-api app/web.1 10.52.10.57 - - [26/Mar/2021:16:25:49 +0000] "POST /api/uploads/initialize/ HTTP/1.1" 200 645 "-" "dandi/0.12.1+4.gd41969c.dirty requests/2.25.1 CPython/3.8.6"
Mar 26 09:25:49 dandi-api heroku/router at=error code=H12 desc="Request timeout" method=POST path="/api/uploads/initialize/" host=api.dandiarchive.org request_id=a11f94ee-c897-4cb1-b6f3-1551e99d2331 fwd="129.170.233.10" dyno=web.1 connect=0ms service=30006ms status=503 bytes=0 protocol=https
Mar 26 09:25:49 dandi-api app/web.1 [2021-03-26 16:25:49 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:20)
Mar 26 09:25:49 dandi-api app/web.1 [2021-03-26 16:25:49 +0000] [20] [INFO] Worker exiting (pid: 20)
Mar 26 09:25:50 dandi-api app/web.1 Sentry is attempting to send 0 pending error messages
Mar 26 09:25:50 dandi-api app/web.1 Waiting up to 2 seconds
Mar 26 09:25:50 dandi-api app/web.1 Press Ctrl-C to quit

Edit: today we did get sentry reports which probably relate to this

yarikoptic commented 3 years ago

my guess is that it might take more than 3 seconds to mint all the 5000 (?) upload URLs? relating to #184 I wonder if there is a way to "paginate" upload_urls in some way, i.e. so dandi-api would mint them in "batches" (e.g. of 100) so that a fresh batch then would have its own new expiration time... more work for both dandi-api and dandi-cli (even if possible, don't know S3 enough) but it could help to resolve #184 and possibly this timeout without just blindly increasing it

dchiquito commented 3 years ago

I tried to initialize an upload of size 5TB (the maximum allowed) and the request completed in 20 seconds, which is a reasonable margin on the 30 seconds that heroku gives us. I'm guessing this will only happen during periods of high server load, or if you were trying to start multiple very large uploads simultaneously.

I think that the easiest solution is for the client to handle 503's gracefully and attempt a few retries. I could also try to profile the /initialize/ view, but I suspect that most of it is the unavoidable overhead of serializing 10,000 different parts (about 6MB of JSON). I would like to avoid rearchitecting upload again before release.

yarikoptic commented 3 years ago

we can and will do that on the client side for now: https://github.com/dandi/dandi-cli/issues/520 but I think that dandi-api server also should not rely on clients to handle 503s (imagine it was customary to refresh a web page while browsing since it would keep timing out ;)). May be at least timeout could be increased to allow for a wider margin (e.g. 60sec) so if 503 happens it is really an indication that server is really overloaded?

waxlamp commented 3 years ago

Unfortunately Heroku has a hard limit of 30 seconds per request. But we can certainly look into why these requests sometimes incur enough "overhead" to reach the limit under load.

For now, if we can solve this via client backoff-and-retry, we can consider it solved as a migration blocker (I will not close the issue, only remove it from the migration blocker board).

yarikoptic commented 3 years ago

retries in client are implemented but not sufficient:

2021-03-30T11:06:51-0400 [DEBUG   ] dandi 30050:139698477405568 largeuploads/blobs/f5d/aac/79d9c6c1ba2439200d45680f1cb941087750a000e484ea2973232709d2: Beginning upload
2021-03-30T11:06:51-0400 [DEBUG   ] dandi 30050:139698477405568 POST https://api.dandiarchive.org/api/uploads/initialize/
2021-03-30T11:07:21-0400 [WARNING ] dandi 30050:139698477405568 Retrying requests.sessions.Session.post in 1.1 seconds as it returned <Response [503]>.
2021-03-30T11:07:52-0400 [WARNING ] dandi 30050:139698477405568 Retrying requests.sessions.Session.post in 1.2100000000000002 seconds as it returned <Response [503]>.
2021-03-30T11:08:24-0400 [WARNING ] dandi 30050:139698477405568 Retrying requests.sessions.Session.post in 1.3310000000000004 seconds as it returned <Response [503]>.
2021-03-30T11:08:55-0400 [WARNING ] dandi 30050:139698477405568 Retrying requests.sessions.Session.post in 1.4641000000000006 seconds as it returned <Response [503]>.
2021-03-30T11:09:27-0400 [ERROR   ] dandi 30050:139698477405568 HTTP connection failed

and I have tried multiple times "manually" to consistently get 503.

yarikoptic commented 3 years ago
A shell loop to try different sizes -- for 500GB it seems to be also about 30 sec but returns upload_id, for 1TB - over and no upload_id returned ```shell $> for s in 0 1 1000 1000000 1000000000 500000000000 1000000000000 ; do hs=$(python3 -c "from humanize import naturalsize; print(naturalsize($s))"); echo -n "$s ($hs): "; /usr/bin/time curl --silent -X POST "https://api.dandiarchive.org/api/uploads/initialize/" -H "Authorization: token $DANDI_API_TOKEN" -H "accept: application/json" -H "Content-Type: application/json" -H "X-CSRFToken: SENSORED" -d "{ \"contentSize\": $s, \"digest\": { \"algorithm\": \"dandi:dandi-etag\", \"value\": \"23ec1488a5cd0fc4d08ab1cc5e1a6050-1\" }}" | grep -q upload_id && echo ok || echo fail ;done 0 (0 Bytes): 0.02user 0.00system 0:02.28elapsed 1%CPU (0avgtext+0avgdata 11948maxresident)k 0inputs+0outputs (0major+844minor)pagefaults 0swaps fail 1 (1 Byte): 0.01user 0.00system 0:00.77elapsed 2%CPU (0avgtext+0avgdata 12000maxresident)k 0inputs+0outputs (0major+846minor)pagefaults 0swaps ok 1000 (1.0 kB): 0.00user 0.00system 0:01.15elapsed 1%CPU (0avgtext+0avgdata 12036maxresident)k 0inputs+0outputs (0major+845minor)pagefaults 0swaps ok 1000000 (1.0 MB): 0.01user 0.00system 0:00.25elapsed 7%CPU (0avgtext+0avgdata 12096maxresident)k 0inputs+0outputs (0major+845minor)pagefaults 0swaps ok 1000000000 (1.0 GB): 0.01user 0.00system 0:00.32elapsed 3%CPU (0avgtext+0avgdata 12088maxresident)k 0inputs+0outputs (0major+850minor)pagefaults 0swaps ok 500000000000 (500.0 GB): 0.04user 0.06system 0:30.23elapsed 0%CPU (0avgtext+0avgdata 12032maxresident)k 0inputs+0outputs (0major+849minor)pagefaults 0swaps ok 1000000000000 (1.0 TB): 0.01user 0.00system 0:30.26elapsed 0%CPU (0avgtext+0avgdata 12140maxresident)k 0inputs+0outputs (0major+845minor)pagefaults 0swaps fail ```
dchiquito commented 3 years ago

@yarikoptic I just tried calling /uploads/initialize/ directly with this request body (5TB):

{
  "contentSize": 5497558138880,
  "digest": {
    "algorithm": "dandi:dandi-etag",
    "value": "abc123"
  }
}

It takes 24 seconds to presign all of the URLs, but that's less than 30 seconds, so it worked. Conveniently, I don't see any other uploads in progress at the moment, so extra load is probably what was causing the issues before.

Can you try the large upload again, while the coast is clear? If we can get the 3.6 TB file uploaded, I would like to remove this as a blocker. We should still resolve the issue, but hopefully uploads of this scale will be few and far between.

yarikoptic commented 3 years ago

FWIW - uploading 3.6TB file, speed ~200MBps, so if computed right should take just slightly over 5 hours, and thus less than current increased expiration of a week, so hopefully it succeeds

yarikoptic commented 3 years ago

@satra that monster finished uploading into https://gui-beta-dandiarchive-org.netlify.app/#/dandiset/000029/draft/files?location=largeuploads%2Fblobs%2Ff5d%2Faac%2F . Its dandi-digest is b0a8ccd03697bbfc361a686ac6968598-10000

dchiquito commented 3 years ago

@yarikoptic Since we've succeeded in doing the migration blocking upload, I'm removing this as a migration blocker.

yarikoptic commented 3 years ago

@yarikoptic Since we've succeeded in doing the migration blocking upload, I'm removing this as a migration blocker.

roger that

waxlamp commented 2 years ago

@dchiquito, is this still a problem? Is the fix to move this computation into a celery task?

dchiquito commented 2 years ago

Our options are:

I think computing these URLs in a task and saving them in the DB for retrieval later is needlessly complicated.

waxlamp commented 2 years ago

Our options are:

  • Ignore the problem. In ideal circumstances it's possible for the request to finish, so if it ever happens again we can just tell the users to try again later.

If the problem hasn't cropped up since then, I'm ok with this being our solution at least for the moment.

  • Paginate. Right now all presigned URLs must be returned at once. If we can break up that one request into multiple reasonably sized requests, the problem goes away. We would need to redesign the system so that it can generate presigned URLs for in the "middle" of the blob on demand.

This sounds like the correct solution. What would be involved with figuring out how to get a "middle" URL?

I think computing these URLs in a task and saving them in the DB for retrieval later is needlessly complicated.

Roger.

dchiquito commented 2 years ago

django-s3-file-field has this method that we use for generating upload URLs. We would need to reimplement that to take some kind of pagination keys (probably part index and number of part URLs to generate), then only generate the urls in that range. This should be a new endpoint that lives next to the old one, no need for this to be a breaking change. We can deprecate the old one later.

danlamanna commented 2 years ago

Is this issue still relevant? I replicated this request:

{
  "contentSize": 5497558138880,
  "digest": {
    "algorithm": "dandi:dandi-etag",
    "value": "abc123"
  }
}

and it completed in about 7 seconds. Our p95 for this endpoint over the last month is something like 150ms, so this error seems rare.

waxlamp commented 2 years ago

I don't think it's relevant anymore--closing.