UPPMAX / irods

Project for implementing an iRODS infrastructure on UPPMAX / SciLifeLab
8 stars 3 forks source link

Decide on the use of iphybun tar bundling #20

Closed samuell closed 12 years ago

samuell commented 12 years ago

Activate admin-mode tar-bundling, according to description in: https://www.irods.org/index.php/Bundling ... or decide whether we should use this functionality or not?

For large datasets with lots of small files, I think this can really be a life-saver, since it avoids the overhead in the order of seconds per file, to be added for every little small file that is uploaded. Thus the recommendetion from Swestore centrally about ~20GB chunks wherever possible.

On the other hand, if users upload files themselves, maybe it is not very clear at what point this tar-bundling should be done. E.g. if the user uploads one file at a time, should the file be stored on the cache resource first, until 20GB of files are uploaded, and only then it is bundled and uploaded to Swestore, or should it tar-bundle it immediately, and then re-roll the tarball everytime a file is added ...?

... Or, should we just skip this totally and just encourage users to create their own tarballs in suitable sizes, in order to save upload speed?

(Thinking of this now, this starts to sound more and more reasonable).

samuell commented 12 years ago

@jhagberg I assigned this to you, in case you have time to do some testing on u5, on a recommended strategy for this, and can describe for us later how it works, and what seems to work well ... ?

samuell commented 12 years ago

Well, me and @jhagberg just chatted about this, and realized there are at least two main options, each with their pros and cons, so I think we need some feedback from @brainstorm and @dahlo here:

I. Don't do any automatic tar:ing at all. Instead users are encouraged to tar and compress files themselves. Pros:

Cons:

II. Let users upload to the cache file system only, and then let a rule do periodic tar-bundling and uploading to Swestore (bit like the current Swestore script). Users will still see their individual files when doing an ils, but when downloading, a whole tar-bundle will be downloaded, instead of a single (possibly small) file. Pros:

Cons:

(... then there are some other variants, such as the ibun command, that users can run themselves (quite different from the iphybun, in that individual files will no longer be shown with ils, after an ibun, but instead they are replaced by a tar bundle, also seen from ils) ...)

brainstorm commented 12 years ago

I would go for option I.

27 apr 2012 kl. 15:51 skrev Samuel Lampa:

Well, me and @jhagberg just chatted about this, and realized there are at least two main options, each with their pros and cons, so I think we need some feedback from @brainstorm and @dahlo here:

I. Don't do any automatic tar:ing at all. Instead users are encouraged to tar and compress files themselves. Pros:

  • Quick to retrieve even small files
  • Users get more control, since we can make the iput command "hang" until the file is uploaded to Swestore (not just the cache resource).
  • Possible to retrieve single files also via the existing web interface (https://webdav.swegrid.se), at least in the future, in case we can add individual user accounts on Swestore. Cons:
  • If users don't tar their files, there might become a lot of traffic saturating the network, since the SRM protocol is very verbose and adds a few seconds overhead for each file.
  • Permissions etc of existing files might not be kept in the same way, since uid/gid does not make sense on Swestore (but can be saved by tar in the tarball).

II. Let users upload to the cache file system only, and then let a rule do periodic tar-bundling and uploading to Swestore (bit like the current Swestore script). Users will still see their individual files when doing an ils, but when downloading, a whole tar-bundle will be downloaded, instead of a single (possibly small) file. Pros:

  • Less verbose uploads.
  • We can force compression for all data. Cons:
  • Retrieving of small files will download lots of unneccessary data, and will always take a certain time ... i.e. a 20GB chunk probably takes some 5-10 min to download, and then it's the untar:ing etc.
  • Less control by the users. They are not sure that the file is actually put on Swestore when their iput command finishes.

(... then there are some other variants, such as the ibun command, that users can run themselves (quite different from the iphybun, in that individual files will no longer be shown with ils, after an ibun, but instead they are replaced by a tar bundle, also seen from ils) ...)


Reply to this email directly or view it on GitHub: https://github.com/UPPMAX/irods/issues/20#issuecomment-5381769

samuell commented 12 years ago

Me too, I think :) ... I'll bring up the discussion at the "torsdagsfika" tomorrow though, for some additional feedback, or maybe even better, the USBF meeting on tuesday ...

samuell commented 12 years ago

The torsdagsfika-group wan't really up for some active discussion on this. I guess the USBF meeting would be optimal for getting some input though ... (maybe we could even think of more things that we need to decide, to bring up with them).

jhagberg commented 12 years ago

@samuell Can you update the latest info about this issue. What did the USBF meeting came up with?

samuell commented 12 years ago

@jhagberg We never really came into those gory details at that meeting. We didn't feel people were that interested in the low level details, so I suggest we go with what we realized already, that we provide no tar handling at this point, but instead leave that to the users...