EcoJulia / RasterDataSources.jl

Easily download and use raster data sets in Julia
MIT License
21 stars 10 forks source link

Support MODIS/VIIRS as a new RasterDataSource #51

Closed jguerber closed 1 year ago

jguerber commented 1 year ago

Hi, it's me again !

As someone with a bit of experience with MODIS imagery and that would like to experiment with Julia, I started playing around with RasterDataSources to try and add support for rasters from the MODIS/VIIRS database.

Since the database functions quite differently to other RasterDataSources (see below), this is quite a big PR which introduces new dependencies. This here should therefore not be seen as a PR that actually hopes to be merged one day, but rather as an invitation to discuss about how this could be improved and integrated into the Rasters ecosystem.

Here follows a short-ish description of the MODIS database's organisation, what my fork does with it, and perspectives for bringing this to Rasters.jl users.

MODIS database

The MODIS website, as far as I know, does not provide download-ready raster files, data must be requested via their API and is available only in JSON or csv. This data must then be processed in order to convert it first to arrays, then to raster data.

The API also limits requests in spatial (<=200km height/width, no workaround yet) and temporal (<=10 16-days intervals, workaround already implemented) extent.

Current implementation

Files

Ordered from most required to least required (easiest to rewrite without):

rafaqz commented 1 year ago

This is very cool!

But yes, adding ArchGDAL and DataFrames.jl/CSV.jl add a lot to load time, too much even to add to Rasters.jl. But this is fixable.

  1. We may need to reconceptualise RasterDataSources.jl to include this kind of data source - especially when what we save is a subset from a unique query rather than a specific file. It's a little less useful to keep that on disk locally.

  2. As your files are already JSON strings, did you consider just writing them as ascii rasters instead of using GDAL? There is some code doing the reverse operation here that you can use: https://github.com/mkborregaard/VerySimpleRasters.jl/blob/master/src/importASCII.jl. Probably Rasters.jl should have a direct ascii backend anyway, it would be much faster than GDAL. There is even a half finished one very early in the Rasters.jl commit history. We could make a separate tiny JuliaGeo package for it.

  3. You don't need DataFrames.jl or CSV.jl for what you are doing. Julia also has an internal delimited file writing capacity in the DelimitedFiles module. Lets use that instead and save a lot of load time. There is also Tables.jl and TableOperations.jl for basic table ops with much less overhead, if you really need a table at all. Vectors of NamedTuple get you a long way (and as far as Tables.jl is concerned theyre a valid table)

The other option is just to make MODIS.jl to handle all this complexity. But I do like expanding RasterDataSources.jl better in an attempt to keep the api consistent across other similar data sources.

jguerber commented 1 year ago

Thanks for the comments !

  1. Yeah this way of getting the data is really different from the other sources. The thing is any data analysis project that needs MODIS data has to store the required subsets somehow (R's MODISTools uses dataframes, Python's modis-tools uses .hdf). But then with this current implementation all, say, NDVI MOD13Q1 data is written in the same folder. This may not be super practical, even more if the user does not only use Julia for their project and/or has several different ongoing projects. Maybe the download path could be made dependent on the project for MODIS-style downloads, but stay constant as ENV["RASTERDATASOURCES_PATH"] for the other downloads ? We can take some time to think about this.

  2. This is a very nice suggestion, I didn't know about ascii rasters ! And since _open in Rasters is already meant to support several backends, adding another one should not be too complicated. I don't think it would then need another package, ascii rasters would just become one of the already existing supported raster file formats ? RasterDataSources would then not need a too big update as long as it's able to write ascii rasters.

  3. Yep, i'm more familiar with R than Julia so i went with DataFrames but this is clearly doable without.

I also like the idea of having RDS and Rasters support as many sources and formats as possible, this makes the whole suite easier to understand and to maintain.

rafaqz commented 1 year ago

Hah I was wondering if you were an R user! Yes there are simpler constructs than dataframes in julia that should be just as good. A Vector of NameTuple is a really useful object.

As for storing the data returned by the request, I've been thinking about the options we have:

CSV: currently used. Not sure of the benefits, and it's not a immediately useful spatial format. JSON: format we already downloaded, and we can just write it. But also not a usable spatial format. ASCII: low overhead generally useful spatial data file, although needs multiple files in a folder for each request. TIF: more overheads, but a useful file, and better compression than ASCII in some cases. But also needs multiple files. NetCDF: as for tif but all layers from a query can be stored in one file. HDF5: Not sure why this would be chosen over NetCDF, it's not generally useful as spatial data.

Any preferences? To me ASCII looks pretty good, except we would need to make a folder for each request. But Rasters.jl loads a folder to a RasterStack already, so its not a big deal. Otherwise I kind of like just writing the JSON directly.

Edit: also, R style .grd/.gri files are another really basic format like ascii - the metadata is very similar but the data is in a binary blob instead of text, which you can just write from an array and MMap to read. We could extract the reader/writer part from Rasters.jl into a separate package.

jguerber commented 1 year ago

ASCII seems to be a good compromise, i've started playing around with it a bit in a Rasters.jl fork and it's really practical !

rafaqz commented 1 year ago

Sweet. I think we can make an ~80 line JuliaGeo/ASCIIrasters.jl package that reads and writes ASCII files from a NamedTuple or keywords. Then we can depend on it here and in Rasters.jl

jguerber commented 1 year ago

Something like that ? I can transfer you ownership of the repo to add it in JuliaGeo if you think it looks good. It's mainly quick copy/pastes of what i was doing on my side, so maybe a bit of caution is needed but it's working already quite fine :)

rafaqz commented 1 year ago

It's 82 lines amazing 😂😂

That's exactly what I imagined, I'll transfer to JuliaGeo.

If we can get the bones of a Rasters.jl source using it and something here for Modis we will know it's solid enough to register.

jguerber commented 1 year ago

I got rid of DataFrames, CSV and ArchGDAL dependencies. The tests are still a bit slow because we need several requests to cover all getraster keyword argument cases but the build time is between 2 and 3 times shorter now.

rafaqz commented 1 year ago

Sorry I hadn't updated the package settings to allow CI runs without approval.

codecov[bot] commented 1 year ago

Codecov Report

Base: 74.20% // Head: 80.81% // Increases project coverage by +6.61% :tada:

Coverage data is based on head (0982871) compared to base (502c523). Patch coverage: 93.82% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #51 +/- ## ========================================== + Coverage 74.20% 80.81% +6.61% ========================================== Files 15 18 +3 Lines 469 709 +240 ========================================== + Hits 348 573 +225 - Misses 121 136 +15 ``` | [Impacted Files](https://codecov.io/gh/EcoJulia/RasterDataSources.jl/pull/51?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia) | Coverage Δ | | |---|---|---| | [src/types.jl](https://codecov.io/gh/EcoJulia/RasterDataSources.jl/pull/51/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia#diff-c3JjL3R5cGVzLmps) | `71.42% <ø> (ø)` | | | [src/modis/shared.jl](https://codecov.io/gh/EcoJulia/RasterDataSources.jl/pull/51/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia#diff-c3JjL21vZGlzL3NoYXJlZC5qbA==) | `92.77% <92.77%> (ø)` | | | [src/modis/utilities.jl](https://codecov.io/gh/EcoJulia/RasterDataSources.jl/pull/51/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia#diff-c3JjL21vZGlzL3V0aWxpdGllcy5qbA==) | `93.26% <93.26%> (ø)` | | | [src/modis/products.jl](https://codecov.io/gh/EcoJulia/RasterDataSources.jl/pull/51/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia#diff-c3JjL21vZGlzL3Byb2R1Y3RzLmps) | `96.22% <96.22%> (ø)` | | | [src/shared.jl](https://codecov.io/gh/EcoJulia/RasterDataSources.jl/pull/51/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia#diff-c3JjL3NoYXJlZC5qbA==) | `64.10% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=EcoJulia)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

rafaqz commented 1 year ago

One comment: don't add so many more versions to CI. It's slow and the downloads are pretty large. At most test on 1.6 and 1 so we we have the long term and latest versions.

jguerber commented 1 year ago

I'm not sure i get what you mean here. The Julia version used in CI.yml is only 1.6 ? I'm still not super familiar with github actions so maybe i'm misunderstanding something.

rafaqz commented 1 year ago

Ah I was just skimming and saw the 1.6, 1.7, 1.8 list. But that's actually compat! I was proabably confused because you don't need to list all the compatible Julia versions in compat, just the lowest one. So 1.6 is all you need there, its fine as-is

rafaqz commented 1 year ago

Passing! Congrats!

Just checking you are happy with everything here before I merge? The only minor quibble I have is km_ab and sin_to_ll being a little criptic, but its also not important in the wider scheme of things.

jguerber commented 1 year ago

Looks good :)

Using km_ab (meaning km above and below) is consistent with the R package for MODIS so it might not be that cryptic for users experienced with the R package. I agree for sin_to_ll, there's a docstring but maybe it's not enough, idk

rafaqz commented 1 year ago

Ah ok. I couldn't actually see what ab meant. I think we should at least put that in the doc string so there's a way to remember it for non R people.

Also FWIY julia leans a lot more towards spelling out acronyms than R and old fortran did, too many acronyms and it gets hard for new people to know whats going on.

rafaqz commented 1 year ago

Merged! thanks for the PR

jguerber commented 1 year ago

You're welcome :)