Closed chStaiger closed 6 years ago
Hi @chStaiger, it seems you tested the normal upload.
To reach better performance we needed to find a way that allows direct streaming of the HTTP packets to the iRODS socket; with Python/Flask this was possible only by allowing the content type HTTP header to be application/octet-stream
.
Without doing so the web server would first save the file into his filesystem cache and then send it to iRODS, resulting in high overhead.
You can see the example for trying the "streaming mode" in the docs: https://eudat-b2stage.github.io/http-api/docs/user/registered.html#put
I have made a test with 10GB by using the curl streaming on your code:
# test performed from a host outside of CINECA
time curl -X PUT \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/octet-stream" \
-T file10G \
$SERVER/api/registered/YOUR/PATH/TO/FOLDER/FILENAME
...
real 3m58.268s
Could you please have another try with this and come back to me?
With the extended header I get:
real 4m52.104s
And when uploading larger files, the files get only transferred partially:
time curl -X PUT -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/octet-stream" \
-F file=@file100G $SERVER/api/registered/cinecaDMPZone1/home/cstaiger/upload/file100G_22-02
curl: (18) transfer closed with 233 bytes remaining to read
real 5m8.219s
user 0m6.540s
sys 0m4.280s
The only problem is, when you use streaming in the python API you will loose the ability to trigger event hooks in iRODS and you will loose the full audit trail in the iRODS logs, since iRODS will not notice, that someone is accessing its data objects. See ticket here: https://github.com/irods/python-irodsclient/issues/117
For comparison. A normal transfer with the iRODS native protocol for a 10GB file takes real 1m16.603s
.
when uploading larger files, the files get only transferred partially
I think there might be some timeout / limit left somewhere, I could double check
when you use streaming in the python API you will loose the ability to trigger
You can't have it all with the current python library at the moment
A normal transfer with the iRODS native protocol for a 10GB file takes
Probably C++ original client leverage some parallel transfer on the socket so it is much more performant. I don't think we can gain much more in that direction.
In general our HTTP API wants to make things easier and simpler with a standard interface. We reached that with Python by losing some on the performance side.
I think this can be closed until we found a new way to boost performance based on prc
library.
I quickly tested the upload performance of a rather medium file. Data transfer took place from SURFsara HPC cluster to the B2STAGE HTTP API test instance at CINECA. Code:
Result: