JuliaGeo / meta

For discussion centered around the JuliaGeo organization
MIT License
6 stars 1 forks source link

Package to provide access to common geographic datasets #5

Closed Alexander-Barth closed 5 years ago

Alexander-Barth commented 5 years ago

I think it would be useful to have a package that gives access to common geographic data sets (like coastline, land-sea mask, topography...) similar to the data sets which are included (or automatically downloaded on first use) by e.g. basemap or cartopy. Here is a list of data sets that I have in mind:

https://github.com/matplotlib/basemap/tree/master/lib/mpl_toolkits/basemap/data

The primary use case would be for visualization (but this would be outside of the scope of this package). An example function that could be part of this package this one: https://github.com/Alexander-Barth/APTDecoder.jl/blob/master/src/data.jl#L2

The data is download and cached via the package RemoteFiles.jl. Maybe this package could be called GeoDatasets.jl ? I think that JuliaGeo would be the ideal place to host such package. Being new to this organization, I am not sure what the process for adding a package.

But if such a package already exists, I would love to hear from it.

visr commented 5 years ago

Yeah it sounds useful to me to have a GeoDatasets.jl package in here!

I haven't done much research, but some initial questions/comments that pop up in my head. No need to answer them all, but perhaps good to consider.

meggart commented 5 years ago

I think that we don't want to have a dependency on GDAL for this package, so it is probably best to just require that whatever is returned, should have the GeoInterface implemented?

This is a very good point. However, currently Geointerface is very much centered on Vector data and @Alexander-Barth was explicitly bringing up raster data. So I think just downloading the data and letting the user (or package) decide how to deal with it would be a fast solution for now. If we come up with stable interfaces in the future it would of course be nice it the file can be automatically opened and accessed through the interface, but to me it seems we are not that far yet.

Regarding accessing GSHHG directly, I think this would be a good idea. I have some native Julia code for rasterizing polygons/shapefiles here https://github.com/esa-esdl/ESDL.jl/blob/master/src/Proc/Shapes.jl#L123-L190 which I wanted to factor into a package anyway. Then we don't have to supply stuff like land-sea-masks on all possible resolutions and projections, but could generate them on the fly, which is sufficiently fast for my applications.

visr commented 5 years ago

Yeah good point on the raster data not yet having a GeoInterface.

The rasterization code would probably good to land somewhere, not sure where. But GeoDatasets itself doesn't quite feel like the right place to me? If this package is only about providing unaltered access to geospatial data sets.

meggart commented 5 years ago

I did not want to suggest to put this into GeoDatasets. My point was just that maybe we should be no reason to support a multitude of resolutions for rasterized shape data, but rather have the tools somewhere to do this for any target.

Alexander-Barth commented 5 years ago

Thank you all for you comments!

On Thu, Aug 29, 2019 at 3:38 PM Martijn Visser notifications@github.com wrote:

Yeah it sounds useful to me to have a GeoDatasets.jl package in here!

I haven't done much research, but some initial questions/comments that pop up in my head. No need to answer them all, but perhaps good to consider.

-

Is it better to use RemoteFiles https://github.com/helgee/RemoteFiles.jl or DataDeps https://github.com/oxinabox/DataDeps.jl? Two examples of similar packages are RDatasets https://github.com/JuliaStats/RDatasets.jl and MLDatasets https://github.com/JuliaML/MLDatasets.jl/. RDatasets is older, and simply puts the datasets in the git repository, which seems undesirable for the sometimes large files we need. MLDatasets uses DataDeps. We may be able to model it a bit on that package?

So far I have just experience with RemoteFiles. They they seem to be pretty similar, but I must say that RemoteFiles.jl looks more straightforward too me (no registration step, DateDeps.jl ask for permission prior downloading, which can however be deactivated using an env. variable). But I have just tried DataDeps.jl for a couple of minutes.

-

How do we return the datasets to the users? One option is just to give the path to the file, and ask users to read it themselves. This should definitely be one of the options, since you won't know if somebody want to continue processing it in GDAL (then a GDAL Dataset may be most convenient), or elsewhere. I think that we don't want to have a dependency on GDAL for this package, so it is probably best to just require that whatever is returned, should have the GeoInterface implemented?

I propose to use plain julia type (at least for now). For the land-sea mask: a vector with longitude, a vector with latitude and an array with the raster data.

-

GeoDatasets.jl is quite broad, and could amount to a large collection. But so is MLDatasets, and as long as the data is only downloaded when requested I think this is fine. Some datasets may be so big that we wouldn't want to download them at all, and just fetch the part we want from a web service. Probably not something to consider directly, but good to keep in mind for the future.

So far, the one that I have in mind are the dataset integrated in basemap (essentially what could be useful to make nice plots).

-

For which datasets to use, I think that it is preferred to stick to original well defined dataset that we can download automatically. The basemap data folder you linked is a nice set, but as a source, can we link to GSHHG directly instead? I was not yet familiar with GSHHG. But the example that I was thinking about first was Natural Earth http://www.naturalearthdata.com/. Is there a list of open datasets available somewhere that we can use?

In principle I agree that it is preferable to download the original data source. However, the distribution format by the data provider might not always be the most convinient. NetCDF support (both NetCDF.jl and NCDatasets.jl ) require currently Conda which is a quite large dependency or data provider bundeld several file together in a tar/zip file. Not all user might need to download the very high-resolution data (but which are also included in the tar file).

Alexander-Barth commented 5 years ago

On Thu, Aug 29, 2019 at 4:03 PM Fabian Gans notifications@github.com wrote:

I think that we don't want to have a dependency on GDAL for this package, so it is probably best to just require that whatever is returned, should have the GeoInterface implemented?

This is a very good point. However, currently Geointerface is very much centered on Vector data and @Alexander-Barth was explicitly bringing up raster data. So I think just downloading the data and letting the user (or package) decide how to deal with it would be a fast solution for now. If we come up with stable interfaces in the future it would of course be nice it the file can be automatically opened and accessed through the interface, but to me it seems we are not that far yet.

Regarding accessing GSHHG directly, I think this would be a good idea. I have some native Julia code for rasterizing polygons/shapefiles here https://github.com/esa-esdl/ESDL.jl/blob/master/src/Proc/Shapes.jl#L123-L190 which I wanted to factor into a package anyway. Then we don't have to supply stuff like land-sea-masks on all possible resolutions and projections, but could generate them on the fly, which is sufficiently fast for my applications.

This would be indeed a quite useful package. Your work on the Earth System Datacube is very interesting!

visr commented 5 years ago

Ok yeah if the original data is NetCDF only that may be a good reason to use a different souce.

For the raster data you mention just using Array, but I suppose we will at least need something like a CoordinateTransformations.jl AffineMap to know where the pixels, see also https://github.com/JuliaGeo/GeoInterface.jl/issues/16.

But so if you want feel free to create the repository here! Would be nice to have for sure.

Alexander-Barth commented 5 years ago

OK, thanks for your positive inputs. I will start a new repo GeoDatasets and feel free to comment by opening an issue.