carpentries-incubator / geospatial-python

Introduction to Geospatial Raster and Vector Data with Python
https://carpentries-incubator.github.io/geospatial-python/
Other
155 stars 57 forks source link

85 new example data #91

Closed rogerkuou closed 2 years ago

rogerkuou commented 2 years ago

Follow the discussion in #85, add new example vector data to the course. Potentially we can use them to interact with Sentinel-2 data.

Three example datasets are added. There is also a Notebook demonstrating S2 data cropping with the three datasets.

  1. Crop fields (Basisregistratie Gewaspercelen).

    • polygons
    • Dutch crop fields boundaries.
    • Cropped to an AoI. The process of cropping can be found in the notebook.
  2. Dikes (BRO Geomorfologische Kaart van Nederland 2019 V1):

    • Dikes in the AoI, in polyline
    • Manually exclude irrelevant layers and crop data with QGIS to the AoI
    • Manually converted to polylines with QGIS.
  3. Groud water monitoring well (BRO Grondwatermonitoringnet):

    • points
    • Manually exclude irrelevant layers and crop data with QGIS to the AoI
rogerkuou commented 2 years ago

Hi @rbavery, thanks for the feedback! I just removed the Notebook. please feel free to have a last check.

rbavery commented 2 years ago

lgtm feel free to merge it. I just talked with the NASA DEVELOP folks and they are very on board with the idea of us reworking this set of lessons to start the lessons from STAC data and only host vector files somewhere.

@rogerkuou @fnattino I'm thinking that somewhere should be this repo instead of this Figshare since the vector files are small and we won't need to host tif files anymore.

@rogerkuou If you want to before merging, can you add a zip file containing all the vector data in the data folder? Then, we could download the zip from this github repo instead of figshare. I wasn't sure if github supported this but looks like it does https://github.com/rbavery/testzip/raw/main/files.zip

rogerkuou commented 2 years ago

Hi @rbavery, I had a discussion with @fnattino and would like to propose the following:

  1. We keep the Figshare as our data repo, and transfer all vector data currently on GitHub to Figshare. The reason for this is mainly the data version control. Also on GitHub the data volume is quite easy to cumulate in the history, so even MB level data can get out of control.

  2. I will cut the current AMS true color image to slightly larger than the current AoI, and host it also on Figshare. This will be ~40MB. After discussion, we think it is likely that the data episode will come at the very end of the course, and therefore we'd better not involve any STAC-related stuff in previous episodes. This is also part of the reason why we want to keep using Figshare.

  3. We replace the examples in other episodes with the new raster and vector data. And discard the old DSM data example. Mainly for the purpose of making the content interesting. And if we are already doing the above two points, it does not make sense to host another example.

We are still in the phase of discussion so just let us know what do you think. Meanwhile, I will make a zip with all AMS vector + raster data.

rbavery commented 2 years ago

Also on GitHub the data volume is quite easy to cumulate in the history, so even MB level data can get out of control.

Makes sense, we won't use github for hosting vector data then.

I will cut the current AMS true color image to slightly larger than the current AoI, and host it also on Figshare. This will be ~40MB. After discussion, we think it is likely that the data episode will come at the very end of the course, and therefore we'd better not involve any STAC-related stuff in previous episodes. This is also part of the reason why we want to keep using Figshare.

In the long term, do we still want to move away hosting any raster data files and get all our raster data from STAC? Can we get that AMS data from STAC instead of hosting it on Figshare @fnattino @rogerkuou ? I think it would be valuable to make this happen in the long term and shift the order of the STAC Data Access episode to come first. The STAC access episode is imo a more exciting introduction to Python's strengths when working with raster data. It could lead in nicely to the Working with Raster Data episode, rather than being in the back as Episode 17. I agree that in the near term it makes sense to host raster data on Figshare.

We replace the examples in other episodes with the new raster and vector data. And discard the old DSM data example. Mainly for the purpose of making the content interesting. And if we are already doing the above two points, it does not make sense to host another example.

I agree this is good to do in the near term, but I'm keen to hear what you both think about reordering and adapting the data access episode so that we query the raster files we need to work with and save them out up front in the first episode, then go through lessons like crop, reproject, raster calculations, etc.

rogerkuou commented 2 years ago

Hi @rbavery Thanks for the reply, and sorry for the long silence.

Indeed I think starting with STAC in this course can be very fascinating. I do agree to make it as one flavor of the workshop, but maybe not by default a mandatory part. The major concern of me and Francesco is that for some audiences the STAC part may be too difficult. Besides, the STAC data repository may also change, which will give us surprises. We still think at least for now we make STAC as an optional episode. For some certain workshops, if the instructors feel comfortable, they can choose to start with the STAC episode and make other episodes dependent on it. And @fnattino if I missed something please feel free to add here.

We can take some time to decide the data on other episodes. For now, I think I will just try to add the new data to the Figshare. Actually @rbavery I need your help on that. I put the raster and vector data into a .zip, and put it in my working drive. Could you please point me to a guide on how to update the Figshare data?

I will close this PR for now since we are not going to host the vector data on GH.