arXiv / arxiv-submission-ui

User interface of NG submit system.
MIT License
2 stars 6 forks source link

Slowness on file upload step #101

Open erickpeirson opened 5 years ago

erickpeirson commented 5 years ago

Noticing that even small uploads can take a few seconds, and occasionally time out. We should take a closer look at uWSGI tuning for filemanager, and see if we can pin down the bottleneck. ARXIVNG-2211

erickpeirson commented 5 years ago

@DavidLFielding I am noticing that filemanager is restarting a lot; seeing things like:

xxx.xxx.xxx.xxx - -|- [-] [kube-probe/1.10] "GET /filemanager/api/status HTTP/1.1" 200 107 97187 -
DEBUG:filemanager.controllers.upload:Handling upload request for 13
ERROR:filemanager.controllers.upload:Upload 'archive' not specified. Oversize calculation will use default values.
INFO:filemanager.controllers.upload:13: Upload files to existing workspace: file='2288326.tar.gz'
...brutally killing workers...
worker 1 buried after 1 seconds
binary reloading uWSGI...
chdir() to /opt/arxiv
closing all non-uwsgi socket fds > 2 (max_fd = 1048576)...
erickpeirson commented 5 years ago

Doesn't look like it's starved for resources:

image

erickpeirson commented 5 years ago

Via external API, seeing responses in the 500-700ms range on uploads to /.

erickpeirson commented 5 years ago

Responses from getUploadStatus endpoint are in the 150-350ms range.

erickpeirson commented 5 years ago

Ok, so one thing that I am seeing is that, by default, the filemanager service integration will follow 2xx-series redirects: https://github.com/arXiv/arxiv-base/blob/master/arxiv/integration/api/service.py#L162

So when filemanager says 201 Created, submission UI will issue a GET request to the endpoint in the Location header. This isn't needed, since filemanager will immediately return the disposition of the upload when responding to the POST request.

erickpeirson commented 5 years ago

Removed the redirect in 34ae26357800d7225bafcd15e9237742dbd4dc92, but still slow. Seeing really long response times...this was an 18 byte upload that took 8,182,050 microseconds (8.2 seconds):

DEBUG:filemanager.controllers.upload:Handling upload request for 4
ERROR:filemanager.controllers.upload:Upload 'archive' not specified. Oversize calculation will use default values.
INFO:filemanager.controllers.upload:4: Upload files to existing workspace: file='head.tmp'
INFO:filemanager.controllers.upload:4: Processed upload. Saved to DB. Preparing upload summary.
INFO:filemanager.controllers.upload:4: Generating upload summary.
DEBUG:filemanager.controllers.upload:Response data: {'upload_id': 4, 'upload_total_size': 7437610, 'upload_compressed_size': 1125161, 'created_datetime': datetime.datetime(2019, 6, 4, 15, 18, 31, tzinfo=<UTC>), 'modified_datetime': datetime.datetime(2019, 6, 4, 18, 49, 48, tzinfo=<UTC>), 'start_datetime': datetime.datetime(2019, 6, 5, 20, 15, 5, 44505, tzinfo=<UTC>), 'completion_datetime': datetime.datetime(2019, 6, 5, 20, 15, 10, 528678, tzinfo=<UTC>), 'files': [{'name': '19.eps', 'public_filepath': '19.eps', 'size': 817320, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:49.270000+00:00'}, {'name': '10a.eps', 'public_filepath': '10a.eps', 'size': 790044, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:47.980000+00:00'}, {'name': '16.ps', 'public_filepath': '16.ps', 'size': 29147, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:42.745000+00:00'}, {'name': '7.ps', 'public_filepath': '7.ps', 'size': 32013, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:51.930000+00:00'}, {'name': '3.ps', 'public_filepath': '3.ps', 'size': 35930, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:43.173000+00:00'}, {'name': '13b.eps', 'public_filepath': '13b.eps', 'size': 504832, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:48.765000+00:00'}, {'name': '1.ps', 'public_filepath': '1.ps', 'size': 43078, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:43.749000+00:00'}, {'name': '11b.eps', 'public_filepath': '11b.eps', 'size': 646615, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:48.340000+00:00'}, {'name': '12a.eps', 'public_filepath': '12a.eps', 'size': 984316, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:48.471000+00:00'}, {'name': '9.ps', 'public_filepath': '9.ps', 'size': 28286, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:44.711000+00:00'}, {'name': '14.ps', 'public_filepath': '14.ps', 'size': 32196, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:44.945000+00:00'}, {'name': '5.ps', 'public_filepath': '5.ps', 'size': 27561, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:45.189000+00:00'}, {'name': '17.eps', 'public_filepath': '17.eps', 'size': 717506, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:49.079000+00:00'}, {'name': '8.ps', 'public_filepath': '8.ps', 'size': 31371, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:54.178000+00:00'}, {'name': '11a.eps', 'public_filepath': '11a.eps', 'size': 726267, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:48.207000+00:00'}, {'name': '10b.eps', 'public_filepath': '10b.eps', 'size': 331105, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:48.072000+00:00'}, {'name': '4.ps', 'public_filepath': '4.ps', 'size': 36238, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:46.525000+00:00'}, {'name': '2.ps', 'public_filepath': '2.ps', 'size': 57263, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:46.806000+00:00'}, {'name': '13a.eps', 'public_filepath': '13a.eps', 'size': 681822, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:48.670000+00:00'}, {'name': '12b.eps', 'public_filepath': '12b.eps', 'size': 330845, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:48.558000+00:00'}, {'name': '6.ps', 'public_filepath': '6.ps', 'size': 28651, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:47.503000+00:00'}, {'name': '15.ps', 'public_filepath': '15.ps', 'size': 33300, 'type': 'Postscript', 'modified_datetime': '2019-06-04T18:49:47.765000+00:00'}, {'name': '18.eps', 'public_filepath': '18.eps', 'size': 491886, 'type': 'Postscript', 'modified_datetime': '2019-06-04T15:18:49.172000+00:00'}, {'name': 'head.tmp', 'public_filepath': 'anc/head.tmp', 'size': 18, 'type': 'Always ignore', 'modified_datetime': '2019-06-05T20:15:05.367000+00:00'}], 'errors': [], 'upload_status': 'READY', 'workspace_state': 'ACTIVE', 'lock_state': 'UNLOCKED', 'source_format': 'ps', 'checksum': '4NcQR6urVfQQKlYUL5zJQQ=='}
"[removed] [removed] - -|- [-] [python-requests/2.21.0] \"POST /filemanager/api/4 HTTP/1.1\" 201 3772 8182050 -"

(from filemanager service log)

erickpeirson commented 5 years ago

@DavidLFielding so I think we know so far:

  1. Client -> Nginx -> uwsgi -> Flask behaves like we expect; we get reasonable response times on upload requests.
  2. UI -> Requests -> uwsgi -> Flask is really really slow.
erickpeirson commented 5 years ago

After the refactor (https://github.com/arXiv/arxiv-filemanager/pull/43), I isolated the remaining slowness to packing the source package tar-ball. I was able to get about a 10x speedup switching from tarfile package to using subprocess to use the system tar -czf. Still seeing about 2 seconds on the pack step:

upload workspace exists at 8.344650268554688e-06
upload workspace retrieved at 0.0023500919342041016
workspace finished processing upload at 0.020180702209472656
workspace persisted at 0.1458911895751953
source packed at 2.3734874725341797
db updated at 2.3939309120178223
workspace serialized at 2.473506450653076
done at 2.494154453277588