brokamp-group / s3

More efficient downloading and usage of files hosted on AWS S3
http://brokamp-group.github.io/s3/
Other
3 stars 2 forks source link

Allow providing custom `dest_file`? #26

Closed rcannood closed 1 month ago

rcannood commented 2 months ago

The dest_file seems to be hardcoded to the following:

https://github.com/brokamp-group/s3/blob/6aa8253b2fed08dca40ad929b095b977d8cb7623/R/s3_get.R#L33C1-L39C7

  dest_file <-
    fs::path_join(c(
      tools::R_user_dir("s3", "data"),
      s3_uri_parsed$bucket,
      s3_uri_parsed$folder,
      s3_uri_parsed$file_name
    ))

Could we add dest_file = NULL as an argument of s3_get such that if the user wants to, he/she can download it to a different location?

Happy to create a PR for this!

cole-brokamp commented 2 months ago

Hi, does changing the default R user dir not work for this? You can specify it using an environment variable, see tools::R_user_dir for more info.

rcannood commented 2 months ago

This would still force the path to be $USER_DIR/s3/data/bucket/folder/filename.

What I'd like to do is:

s3::s3_get(
  s3_uri = "s3://cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad",
  dest_file = "/home/user/foo.h5ad"
)
cole-brokamp commented 2 months ago

I'm having trouble understanding why you need to remove the s3 prefixes in a file path. S3 isn't in a hierarchical format, so there is no "folder". The items are specified by a bucket name and a key.

Could you rename the file after downloading it? Adding functionality to download things to folders without a structure needlessly removes portions of the full file name. Doing this breaks the linkage between the code and the downloaded data (e.g. was foo.h5ad on my computer from 2024-07-01 or 2024-07-02 folder?) and that is the main goal of this package.

rcannood commented 1 month ago

I like that this package caches the S3 files using the same file structure as the S3 bucket, but it shouldn't be a requirement. If a user just wants to transfer a single file without retaining the same file structure, it should be possible.

Having said that, my use case is that I need to implement interoperability with a Python package, which also is able to download / cache files from an S3 path, but it does so in a different location -- namely ~/.cache/<name of app>/<bucket>/<path>. For example, s3://cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad would get downloaded to ~/.cache/mypackage/cellxgene-data-public/... but with the current implementation of s3 this is not possible.

For sure I can download it to $USER_DIR/s3/data and then move it to .cache/mypackage but I feel like this is just a workaround, hence my question.

What do you think? :relaxed:

cole-brokamp commented 1 month ago

Hi @rcannood, thanks for the thoughtful reply.

I would consider seeing if the python package can use the XDG Base Directory Specification instead of storing everything in a cache folder in the home directory. This specification is what R_user_dir is based on and was specifically designed for interoperability in situations like this.

This package is really a wrapper around downloading files where the user doesn't have to think about where on their computer that data is being stored and can reload it another R session or project. Since that is the main goal of the package, I hesitate to modify it to allow for saving files to other places.

Could you use download.file() if you want to specify where the file is being saved?

Another work around could be to pipe the location of the file(s) returned by s3_get() into file.rename().

rcannood commented 1 month ago

My project requirements are what they are, unfortunately. I'll need to be able to sync into $HOME/.cache/xxx.

Sure, I could use download.file(), but then I would have to implement much of the functionality that your package already provides.

I'd prefer not to use s3_get and then move the file with file.rename because that feels like workaround. The R_USER_DATA_DIR might even be on a different disk entirely, making the whole operation inefficient.

If I can't change your mind about this, then feel free to close this github issue.

Thanks for your consideration.

cole-brokamp commented 1 month ago

That makes sense on why rename would be inefficient. How would you recommend implementing this? I think I'm open to it, but want to think about the best way to do it.

cole-brokamp commented 1 month ago

Probably an extra argument to s3_get that defaults to tools::R_user_dir("s3", "data") ?

cole-brokamp commented 1 month ago

I think this wouldn't break existing functionality, would still allow for customizing of the R user dir with env variables, but would also provide others with the ability to specify their own download dir. We could keep how the function creates the folder path structure, but just let the user change which folder it is saved into, right?

rcannood commented 1 month ago

That's fantastic news!

Adding an argument to the existing functions should indeed allow for the proposed feature without breaking existing functionality.

I took the liberty of creating a PR. Feel free to close if it you aren't happy with the proposed changes.

cole-brokamp commented 1 month ago

completed in #28

cole-brokamp commented 1 month ago

Thanks again for the help. version 1.1.0 of s3 includes this new feature and was just accepted to CRAN a few minutes ago, so should be showing up soon!

rcannood commented 1 month ago

Thank you so much for doing all of this! I'm glad we managed to come up with a solution that both of us are happy with.

In due time, you'll see a reverse dependency popping up on the CRAN page :)