EUDAT-B2STAGE / http-api

RESTful HTTP-API for the B2STAGE service inside the EUDAT project
https://eudat-b2stage.github.io/http-api/
MIT License
7 stars 7 forks source link

Upload performance #112

Closed chStaiger closed 6 years ago

chStaiger commented 6 years ago

I quickly tested the upload performance of a rather medium file. Data transfer took place from SURFsara HPC cluster to the B2STAGE HTTP API test instance at CINECA. Code:

TOKEN="mysecrettoken"
SERVER='https://b2stage-test.cineca.it'
# Create 10GB file
dd if=/dev/zero of=file10G bs=1G count=10
time curl -X PUT -H "Authorization: Bearer $TOKEN" -F file=@file10G \
     $SERVER/api/registered/cinecaDMPZone1/home/cstaiger/upload/file10G

Result:

real    21m6.607s
pdonorio commented 6 years ago

Hi @chStaiger, it seems you tested the normal upload.

To reach better performance we needed to find a way that allows direct streaming of the HTTP packets to the iRODS socket; with Python/Flask this was possible only by allowing the content type HTTP header to be application/octet-stream.

Without doing so the web server would first save the file into his filesystem cache and then send it to iRODS, resulting in high overhead.

You can see the example for trying the "streaming mode" in the docs: https://eudat-b2stage.github.io/http-api/docs/user/registered.html#put

I have made a test with 10GB by using the curl streaming on your code:

# test performed from a host outside of CINECA
time curl -X PUT \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/octet-stream" \
    -T file10G \
    $SERVER/api/registered/YOUR/PATH/TO/FOLDER/FILENAME

...

real    3m58.268s

Could you please have another try with this and come back to me?

chStaiger commented 6 years ago

With the extended header I get:

real    4m52.104s

And when uploading larger files, the files get only transferred partially:

time curl -X PUT -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/octet-stream" \
-F file=@file100G $SERVER/api/registered/cinecaDMPZone1/home/cstaiger/upload/file100G_22-02

curl: (18) transfer closed with 233 bytes remaining to read

real    5m8.219s
user    0m6.540s
sys     0m4.280s
chStaiger commented 6 years ago

The only problem is, when you use streaming in the python API you will loose the ability to trigger event hooks in iRODS and you will loose the full audit trail in the iRODS logs, since iRODS will not notice, that someone is accessing its data objects. See ticket here: https://github.com/irods/python-irodsclient/issues/117

chStaiger commented 6 years ago

For comparison. A normal transfer with the iRODS native protocol for a 10GB file takes real 1m16.603s.

pdonorio commented 6 years ago

when uploading larger files, the files get only transferred partially

I think there might be some timeout / limit left somewhere, I could double check

when you use streaming in the python API you will loose the ability to trigger

You can't have it all with the current python library at the moment

A normal transfer with the iRODS native protocol for a 10GB file takes

Probably C++ original client leverage some parallel transfer on the socket so it is much more performant. I don't think we can gain much more in that direction.

In general our HTTP API wants to make things easier and simpler with a standard interface. We reached that with Python by losing some on the performance side.

pdonorio commented 6 years ago

I think this can be closed until we found a new way to boost performance based on prc library.