EcoJulia / RasterDataSources.jl

Easily download and use raster data sets in Julia
MIT License
21 stars 10 forks source link

Where to store rasters? #23

Open tpoisot opened 3 years ago

tpoisot commented 3 years ago

21 ended up being about org admin, so let me restate the issue here:

  1. we want to store raster data centrally
  2. we want the user to have flexibility about where "centrally" is (can be a folder, can be a server I guess, etc)
  3. we want to make a default decision when the user is not specifying anything, e.g. for small datasets

The current solution is to require ENV["RASTERDATASOURCES_PATH"] - this can work but it requires setting the variable even for small datasets, which is an additional step for users.

The solution I suggested in #21 was to use traits for the different types, but this is also possibly confusing - sometimes things will stop working unless the variable is set (and setting it from the session will not make it permanent).

Solutions like Artifacts don't work because we don't want to download ALL data when the package is built.

Here is my current thinking on this - we might want to keep the idea of a ENV["RASTERDATASOURCES_PATH"], and have a greet() function that reminds users of what it does. Specifically, if there is no such path set, we can use a folder in @__DIR__ to store the data? Users who don't want to make a choice will have their data there, users who want to specify a path will have a choice.

@rafaqz what do you think?

rafaqz commented 3 years ago

Isnt this just going back to your origional proposal? Downloading 30GB will still go into @_DIR_ by default. For me the important consideration is that that doesn't happen.

Its not clear to me why traits wont work? The idea is the variable is set outside the session, and that will only affect large datasets anyway? Maybe we should just use the new preferences system to set this instead of the env var. But a PR would make this more substantial thing to discuss.

We cant use artifacts because there are millions of files.

tpoisot commented 3 years ago

Agreed that we can't use Artifacts. I'm slowly realizing the size of other datasets - I might end up being convinced that maybe asking to set a path is not unreasonable - I also don't want to write into @__DIR__.

rafaqz commented 3 years ago

Lets use this https://github.com/JuliaPackaging/Preferences.jl?

Then you can set preferences in session, after an error gives you an example of what to do. And the path will stick when you set it. The trait can work as planned earlier and we only throw the error require setting the preferences for the large weather datasets.

rafaqz commented 3 years ago

Yeah, literally multuple terrabytes!

Preferences.jl could be the middle ground we need, its less weird than ENV

rafaqz commented 3 years ago

For cesar its important to have both scales... We'll run GrowthMaps.jl with tiny climate datasets for exploration and sharing ideas, but swap to hundred GB datasets for fitting real models - the GrowthMaps/GeoData/RasterDataSources combo abstracts that away and the output is the same format.

rafaqz commented 3 years ago

But I think you are right for Bioclim and Climate, I had to set the path in support scripts for a paper, and its pretty awful, and makes the script not reproducable without editing.

tpoisot commented 3 years ago

Let's definitely go with Preferences - I'll work on this when I've made progress on the future bioclim data

asinghvi17 commented 1 year ago

Would it make sense to use a scratch space from Scratch.jl as a default, with the user being given the option to override that by some mechanism (either Preferences.jl or some environment variable)?

rafaqz commented 1 year ago

The Scratch.jl docs kind of say not to use it for this use case :

Because the scratch space location on disk is not very user-friendly, scratch spaces should, in general, not be used for a storing files that the user must interact with through a file browser. In that event, packages should simply write out to disk at a location given by the use

I personally occasionally manage these files in a browser - say to copy them for someone else when I've downloaded a lot.

But I know the current solution kind of sucks too.

Some of these future climate datasets and current weather datasets are many GB downloadable with a single command, so we need to be a little bit careful about the location and let users access and manage it.