Closed jguerber closed 1 year ago
This is very cool!
But yes, adding ArchGDAL and DataFrames.jl/CSV.jl add a lot to load time, too much even to add to Rasters.jl. But this is fixable.
We may need to reconceptualise RasterDataSources.jl to include this kind of data source - especially when what we save is a subset from a unique query rather than a specific file. It's a little less useful to keep that on disk locally.
As your files are already JSON strings, did you consider just writing them as ascii rasters instead of using GDAL? There is some code doing the reverse operation here that you can use: https://github.com/mkborregaard/VerySimpleRasters.jl/blob/master/src/importASCII.jl. Probably Rasters.jl should have a direct ascii backend anyway, it would be much faster than GDAL. There is even a half finished one very early in the Rasters.jl commit history. We could make a separate tiny JuliaGeo package for it.
You don't need DataFrames.jl or CSV.jl for what you are doing. Julia also has an internal delimited file writing capacity in the DelimitedFiles
module. Lets use that instead and save a lot of load time. There is also Tables.jl and TableOperations.jl for basic table ops with much less overhead, if you really need a table at all. Vectors of NamedTuple
get you a long way (and as far as Tables.jl is concerned theyre a valid table)
The other option is just to make MODIS.jl to handle all this complexity. But I do like expanding RasterDataSources.jl better in an attempt to keep the api consistent across other similar data sources.
Thanks for the comments !
Yeah this way of getting the data is really different from the other sources. The thing is any data analysis project that needs MODIS data has to store the required subsets somehow (R's MODISTools uses dataframes, Python's modis-tools uses .hdf). But then with this current implementation all, say, NDVI MOD13Q1 data is written in the same folder. This may not be super practical, even more if the user does not only use Julia for their project and/or has several different ongoing projects. Maybe the download path could be made dependent on the project for MODIS-style downloads, but stay constant as ENV["RASTERDATASOURCES_PATH"]
for the other downloads ? We can take some time to think about this.
This is a very nice suggestion, I didn't know about ascii rasters ! And since _open
in Rasters is already meant to support several backends, adding another one should not be too complicated. I don't think it would then need another package, ascii rasters would just become one of the already existing supported raster file formats ? RasterDataSources would then not need a too big update as long as it's able to write ascii rasters.
Yep, i'm more familiar with R than Julia so i went with DataFrames but this is clearly doable without.
I also like the idea of having RDS and Rasters support as many sources and formats as possible, this makes the whole suite easier to understand and to maintain.
Hah I was wondering if you were an R user! Yes there are simpler constructs than dataframes in julia that should be just as good. A Vector
of NameTuple
is a really useful object.
As for storing the data returned by the request, I've been thinking about the options we have:
CSV: currently used. Not sure of the benefits, and it's not a immediately useful spatial format.
JSON: format we already downloaded, and we can just write
it. But also not a usable spatial format.
ASCII: low overhead generally useful spatial data file, although needs multiple files in a folder for each request.
TIF: more overheads, but a useful file, and better compression than ASCII in some cases. But also needs multiple files.
NetCDF: as for tif but all layers from a query can be stored in one file.
HDF5: Not sure why this would be chosen over NetCDF, it's not generally useful as spatial data.
Any preferences? To me ASCII looks pretty good, except we would need to make a folder for each request. But Rasters.jl loads a folder to a RasterStack
already, so its not a big deal. Otherwise I kind of like just writing the JSON directly.
Edit: also, R style .grd/.gri files are another really basic format like ascii - the metadata is very similar but the data is in a binary blob instead of text, which you can just write from an array and MMap to read. We could extract the reader/writer part from Rasters.jl into a separate package.
ASCII seems to be a good compromise, i've started playing around with it a bit in a Rasters.jl fork and it's really practical !
Sweet. I think we can make an ~80 line JuliaGeo/ASCIIrasters.jl package that reads and writes ASCII files from a NamedTuple or keywords. Then we can depend on it here and in Rasters.jl
Something like that ? I can transfer you ownership of the repo to add it in JuliaGeo if you think it looks good. It's mainly quick copy/pastes of what i was doing on my side, so maybe a bit of caution is needed but it's working already quite fine :)
It's 82 lines amazing 😂😂
That's exactly what I imagined, I'll transfer to JuliaGeo.
If we can get the bones of a Rasters.jl source using it and something here for Modis we will know it's solid enough to register.
I got rid of DataFrames, CSV and ArchGDAL dependencies. The tests are still a bit slow because we need several requests to cover all getraster
keyword argument cases but the build time is between 2 and 3 times shorter now.
Sorry I hadn't updated the package settings to allow CI runs without approval.
Base: 74.20% // Head: 80.81% // Increases project coverage by +6.61%
:tada:
Coverage data is based on head (
0982871
) compared to base (502c523
). Patch coverage: 93.82% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
One comment: don't add so many more versions to CI. It's slow and the downloads are pretty large. At most test on 1.6 and 1 so we we have the long term and latest versions.
I'm not sure i get what you mean here. The Julia version used in CI.yml is only 1.6 ? I'm still not super familiar with github actions so maybe i'm misunderstanding something.
Ah I was just skimming and saw the 1.6, 1.7, 1.8
list. But that's actually compat! I was proabably confused because you don't need to list all the compatible Julia versions in compat, just the lowest one. So 1.6
is all you need there, its fine as-is
Passing! Congrats!
Just checking you are happy with everything here before I merge? The only minor quibble I have is km_ab
and sin_to_ll
being a little criptic, but its also not important in the wider scheme of things.
Looks good :)
Using km_ab
(meaning km above and below) is consistent with the R package for MODIS so it might not be that cryptic for users experienced with the R package. I agree for sin_to_ll
, there's a docstring but maybe it's not enough, idk
Ah ok. I couldn't actually see what ab
meant. I think we should at least put that in the doc string so there's a way to remember it for non R people.
Also FWIY julia leans a lot more towards spelling out acronyms than R and old fortran did, too many acronyms and it gets hard for new people to know whats going on.
Merged! thanks for the PR
You're welcome :)
Hi, it's me again !
As someone with a bit of experience with MODIS imagery and that would like to experiment with Julia, I started playing around with RasterDataSources to try and add support for rasters from the MODIS/VIIRS database.
Since the database functions quite differently to other
RasterDataSource
s (see below), this is quite a big PR which introduces new dependencies. This here should therefore not be seen as a PR that actually hopes to be merged one day, but rather as an invitation to discuss about how this could be improved and integrated into the Rasters ecosystem.Here follows a short-ish description of the MODIS database's organisation, what my fork does with it, and perspectives for bringing this to Rasters.jl users.
MODIS database
The MODIS website, as far as I know, does not provide download-ready raster files, data must be requested via their API and is available only in JSON or csv. This data must then be processed in order to convert it first to arrays, then to raster data.
The API also limits requests in spatial (<=200km height/width, no workaround yet) and temporal (<=10 16-days intervals, workaround already implemented) extent.
Current implementation
getraster
arguments are handled using already existing internal functions as much as possible. I also tried to be consistent with the rest of the package by using several internal_getraster
calls.list_dates
,list_layers
.modis_request
. For convenience, this uses both JSON.jl and DataFrames.jl but this could be rewritten to avoid usingDataFrame
s.DataFrame
,process_subset
uses ArchGDAL.jl to create .tif files.Files
getraster
,layers
,layerkeys
,rastername
, ...).examples.jl: Some named tuples for keyword arguments used to run tests or examples.
New dependencies
Ordered from most required to least required (easiest to rewrite without):
DataFrames.jl
Future prospects
Integration with Rasters.jl
Rasters.jl needs a small patch before being able to call e.g.
Raster(MODIS{MOD13Q1}, :NDVI; RDS.crozon...)
. It involves extending the number of keyword arguments recognized by_filter_kw
and passing some keyword arguments toRDS.date_sequence
. Since these changes rely on changes in an existing PR, i did not PR this for the moment.Merge/integration perspectives
I think that it would be really cool for Rasters.jl users to also be able to play around with MODIS data, but introducing so many dependencies in quite a big package ecosystem is problematic...
Maybe a solution would be to write a "plugin" package that would depend on both Rasters, RasterDataSources and the new dependencies, and would make the same exports as Rasters ? Another way could be to create a new raster data type, that would suit MODIS data and would be understandable by
Raster
. In both options, a lot of code from this PR will be reusable. What do you think would be the most logical approach in terms of package design ?I'm also super interested in comments/suggestions for the code, i'm still learning Julia so i'm sure that i missed a lot of possible tricks.
Whatever happens, it was (and still is !) quite interesting to look at RasterDataSources.jl and Rasters.jl, hopefully this can help bring more tools to the Julia spatial data community.