[NBI] Exploring Land Cover Data

jamesdamillington commented 2 years ago

What is the notebook about?

This notebook introduces raster land cover data with simple manipulation and basic exploratory analysis techniques. The notebook will be based largely on the existing notebooks I have used for teaching and will examine:

Overview of raster data characteristics
Reading raster data
Plotting categorical raster maps
Analysing aggregate change (through bars charts and similar visualisation)
Analysing zonal change (using ancillary vector data)
Analysing pixel-by-pixel change (including use of sankey diagrams)

There are now many data sources of classified (categorical) land cover data that are useful for Environmental Data Science. These include:

ESA CCI land cover, 300m spatial resolution global extent for years 1992-2015
Copernicus Global Land Cover, 100m spatial resolution global extent for years 2015-2019
USGS LCMAP, 30m spatial resolution for USA for years 1985-2020
UKCEH LCMs, various spatial resolutions for UK for various years 1990-2020
mapbiomas, 30m spatial resolution for Brazil for years 1985-2020

Considerations for deciding which of these sources to use in this notebook include:

licencing: some of these data sources are have open licences, others are more restrictive (notably UKCEH LCMs)
data access: some data sources can easily be accessed programmatically via API, others less easily
data volume: fine-grained data over large extents result in large data (e.g. ESA CCI 300m global extent for a single year is ~300MB)

Code and packages used in this notebook will initially be those used in the original teaching notebooks, notably:

rasterio with numpy ndarrays
pandas and geopandas
matplotlib and seaborn

In time, the code can be changed to use packages from the pangeo ecosystem

Data Science Component

[ ] Sensor visualisation
[ ] Preprocessing
[ ] Modelling
[ ] Post-processing
[x] Other: Data Manipulation and Basic Analysis (best fits in the Exploration section of the current Gallery)

Checklist:

[x] Input data, pipeline and/or model are public with license/citation
[x] The proposed notebook reuses existing codebase
[x] The proposed notebook uses open-source packages
[ ] The proposed notebook is associated to existing publication(s)

Additional information

acocac commented 2 years ago

@jamesdamillington thanks for logging the notebook idea. The outline and suggested datasets look great for me. Some comments as follows:

Outline

Overview of raster data characteristics

what do you mean by raster characteristics? size, number of bands

Reading raster data

Try rioxarray

Plotting categorical raster maps Analysing aggregate change (through bars charts and similar visualisation)

Try Holoviz visualisation toolkits. Most of the existing notebooks in the EnvDS book use them for interactive plotting.

Analysing zonal change (using ancillary vector data)

Use zonal_crosstab from Xarray-Spatial. See here an example in Microsoft's Planetary computer.

Analysing pixel-by-pixel change (including use of sankey diagrams)

Not sure which library is the most optimal, but holoviews seems to support interactive sankey diagrams.

Datasets

All datasets are great and considerations are very relevant. My suggestion is to consider coarser datasets such as MODIS MCD12Q1. The MODIS tiles h17v03 and h18v03 cover the whole UK. The average size of each tile is 6MB per year.

Following the MODIS wildfire notebook which fetches the dataset from NASA’s Earth Data site, find below how I suggest downloading the MODIS land cover dataset for a single year and tile. We'll need to list file names for the remaining years and tiles. You can merge tiles using the merge function in rioxarray (see here).

notebook_folder = './general-exploration-landcover_modis'
if not os.path.exists(notebook_folder):
    os.makedirs(notebook_folder)

fnames = ['MCD12Q1.A2017001.h17v03.006.2019196134714.hdf', 'MCD12Q1.A2018001.h17v03.006.2019199221720.hdf']

for fname in fnames:
    if not os.path.isfile(os.path.join(notebook_folder, fname)) or os.path.getsize(os.path.join(notebook_folder, fname)) == 0:
        username = 'XXX' #replace for your EarthData username if run local or in Binder
        password = 'XXX' #replace for your EarthData username if run local or in Binder

        fsspec.config.conf['https'] = dict(client_kwargs={'auth': aiohttp.BasicAuth(username, password)})

        url = f'https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MCD12Q1/{fname[9:13]}/001/{fname}'
        filename = url.split('/')[-1]
        with fsspec.open(url) as f:
            with Path(os.path.join(notebook_folder, filename)).open('wb') as handle:
                data = f.read()
                try:
                    data.decode('utf-8')
                    raise RuntimeError('Could not download MODIS data! Have you authorized LAADS Web in your Eathdata account above?')
                except UnicodeDecodeError:
                    handle.write(data)

# open a single file
modis_hdf = rioxr.open_rasterio(os.path.join(notebook_folder,fnames[0]))

modis_hdf.LC_Type1.plot()
plt.show()

An alternative to fetch datasets is via STAC (see this example). I've explored the catalog of the Planetary computer, but they don't have the MODIS MCD12Q1 product.

Next steps

My suggestion is to take the route of MODIS or another product your feel comfortable with and follow the steps in the submission guidelines section.

As part the submission process, once you use the notebook template in your personal repository, please indicate in the comment box the URL to the repo. Thank you!

ps. apologies for the late reply, but I was participating in some project meetings last week. FYI, I've updated notebook templates and submission guidelines. Contributions to improve them would be welcome. I hope you enjoy the submission process and in general the community-driven aspects of the project (:

acocac commented 2 years ago

@jamesdamillington I've also found how to fetch lc/lu datasets from stac. Find below a code snippet to fetch ~Esri 10-Meter Land Cover (10-class)~ 10m Annual Land Use Land Cover (9-class) over London using a target resolution of 500 m (it takes longer to fetch the native resolution of 10 m). The example is a combination of two existing notebooks, ODC.stac and Microsoft Planetary. The multi-temporal ~Esri 10-Meter Land Cover (10-class)~ 10m Annual Land Use Land Cover (9-class) seems to suit your notebook idea. Feel free to reuse the code for your contribution. Note you should install pystac_client and odc-stac.

## Example multitemporal land use (london)
from pystac_client import Client
import geopandas as gpd
import matplotlib.pyplot as plt
from odc.stac import stac_load
import rasterio
from pystac.extensions.item_assets import ItemAssetsExtension
import numpy as np
from matplotlib.colors import ListedColormap
import pandas as pd

km2deg = 1.0 / 111
x, y = (-0.118092, 51.509865)  # Center point of a query
r = 100 * km2deg
bbox = (x - r, y - r, x + r, y + r)

catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

query = catalog.search(
    collections=["io-lulc-9-class"],
    limit=100,
    bbox=bbox
)

items = list(query.get_items())
print(f"Found: {len(items):d} datasets")

# Convert STAC items into a GeoJSON FeatureCollection
stac_json = query.get_all_items_as_dict()

gdf = gpd.GeoDataFrame.from_features(stac_json, "epsg:4326")

fig = gdf.plot(
    "io:tile_id",
    edgecolor="black",
    categorical=True,
    aspect="equal",
    alpha=0.5,
    figsize=(6, 12),
    legend=True,
    legend_kwds={"loc": "upper left", "frameon": False, "ncol": 1},
)

plt.show()

# Load with bounding box
r = 40 * km2deg
small_bbox = (x - r, y - r, x + r, y + r)
crs = "epsg:3857"

yy = stac_load(
    items,
    bands=("data"),
    crs=crs,
    resolution=500,
    chunks={},  # <-- use Dask
    groupby="start_datetime",
    bbox=small_bbox,
)

merged = yy.compute()

_ = (
    merged.isel(time=0)
    .to_array("band")
    .plot.imshow(
        col="band",
        size=4,
    )
)
plt.show()

g = merged['data'].plot(col="time")
plt.show()

collection = catalog.get_collection("io-lulc-9-class")
ia = ItemAssetsExtension.ext(collection)

x = ia.item_assets["data"]
class_names = {x["summary"]: x["values"][0] for x in x.properties["file:values"]}
values_to_classes = {v: k for k, v in class_names.items()}
class_count = len(class_names)

with rasterio.open(items[0].assets["data"].href) as src:
    colormap_def = src.colormap(1)  # get metadata colormap for band 1
    colormap = [
        np.array(colormap_def[i]) / 255 for i in range(max(class_names.values()))
    ]  # transform to matplotlib color format

cmap = ListedColormap(colormap)

vmin = 0
vmax = max(class_names.values())

p = merged.data.plot(
    col="time",
    cmap=cmap,
    vmin=vmin,
    vmax=vmax,
    figsize=(16, 6),
)
ticks = np.linspace(0.5, 10.5, 11)
labels = [values_to_classes.get(i, "") for i in range(cmap.N)]
p.cbar.set_ticks(ticks, labels=labels)
p.cbar.set_label("Class")

jamesdamillington commented 2 years ago

Thanks for all these ideas @acocac - I'll explore these packages soon. Using these rather than ones I've used previously may slow completion of the chapter, but it's always good to learn new packages and to be consistent with other content in the book. I'll create a repo with the template soon and share the URL once that's done.

jamesdamillington commented 2 years ago

Chapter repo is here: https://github.com/jamesdamillington/landcover-exploration-nlcd

I am currently planning to use the C-CAP NLCD data: https://planetarycomputer.microsoft.com/dataset/noaa-c-cap The ESRI data for London is not appropriate as it is for a single year and some of my analyses require a time series.

acocac commented 2 years ago

@jamesdamillington no rush 😎 thanks for sharing the repo too. Feel free to complete the notebook sections using the dependencies of the original notebook of landcover exploration. The initial ideas are only potential workarounds to fetch input data. We can have a look at the other suggested packages e.g. Xarray-Spatial during the revision process.

Please add the dependencies in the environment.yml file, then the GitHub action will assist in assessing how reproducible is the notebook (at least in linux OS).

acocac commented 2 years ago

I am currently planning to use the C-CAP NLCD data: https://planetarycomputer.microsoft.com/dataset/noaa-c-cap The ESRI data for London is not appropriate as it is for a single year and some of my analyses require a time series.

Sorry for the confusion, I didn't mean the ESRI data, instead the code snippet uses the multitemporal 10m Annual Land Use Land Cover (9-class) generated by Impact Observatory.

jamesdamillington commented 2 years ago

I am currently planning to use the C-CAP NLCD data: https://planetarycomputer.microsoft.com/dataset/noaa-c-cap The ESRI data for London is not appropriate as it is for a single year and some of my analyses require a time series.

Sorry for the confusion, I didn't mean the ESRI data, instead the code snippet uses the multitemporal 10m Annual Land Use Land Cover (9-class) generated by Impact Observatory.

Okay, great. I'll check it out.

jamesdamillington commented 2 years ago

I am currently planning to use the C-CAP NLCD data: https://planetarycomputer.microsoft.com/dataset/noaa-c-cap The ESRI data for London is not appropriate as it is for a single year and some of my analyses require a time series.

Sorry for the confusion, I didn't mean the ESRI data, instead the code snippet uses the multitemporal 10m Annual Land Use Land Cover (9-class) generated by Impact Observatory.

Okay, great. I'll check it out. But also, I'm now thinking that I'd like to work with MapBiomas data which should be accessible via Google Earth Engine API (hopefully can reduce resolution on server side)

jamesdamillington commented 2 years ago

Right, so I think the authentication needed to use the Google Earth Engine API is going to be overly restrictive. So I have used the code you suggested above and added to the initial notebook, plus updated environment.yml to include pystac_client and odc.stac (had some trouble installing these on Linux Mint OS due to cross-channel conflict from c libraries](https://stackoverflow.com/q/66914685) - need to force strict use of conda-forge).

Initial code seems to work!

Will the environment be automatically checked?

Once environment is checked I'll continue working with these data to integrate my existing code (can update to other suggested packages e.g. Xarray-Spatial later).

Finally, should the repo/notebook name change if I use the ESA Sentinel data instead of the NOAA C-CAP data?

acocac commented 2 years ago

@jamesdamillington thanks for the update. Find below some comments.

Right, so I think the authentication needed to use the Google Earth Engine API is going to be overly restrictive.

Agreed, authentication to certain platforms such as GEE might be restrictive. Have you tried to access GEE STAC items using the odc.stac library?

So I have used the code you suggested above and added to the initial notebook, plus updated environment.yml to include pystac_client and odc.stac (had some trouble installing these on Linux Mint OS due to cross-channel conflict from c libraries](https://stackoverflow.com/q/66914685) - need to force strict use of conda-forge).

The current template only checks if the notebook works in ubuntu linux OS. If you find useful we can report the conflict and workaround with Linux Mint OS in the README of the notebook repo.

Will the environment be automatically checked? The environment will be checked automatically at every push. You can monitor it in the Actions tab.

Once environment is checked I'll continue working with these data to integrate my existing code (can update to other suggested packages e.g. Xarray-Spatial later). Great! Feel free to add the suggested packages later.

Finally, should the repo/notebook name change if I use the ESA Sentinel data instead of the NOAA C-CAP data? We suggest to follow the file name conventions indicated in the submission guidelines:

pattern (XXX-YYY-ZZZ, where XXX refers to the environmental system, YYY to the theme and ZZZ to a preferred identifier of the model, dataset or pre/post-processing pipeline).
- For the LC notebook I'd rename it as “general-exploration-landcover_.ipynb”.
- Bear in mind you should also change the name in the config.json file.

jamesdamillington commented 2 years ago

Repo/Notebook updated: https://github.com/jamesdamillington/general-exploration-landcover

Linux Mint is built on ubuntu so the two should align. The environment issue is easily fixed is all packages are sourced from conda-forge (so need to set this default in the environment.yml, which I think I have now done).

So now, onwards with writing the notebook!

jamesdamillington commented 2 years ago

Hi @acocac - first draft is now complete for you to have a look at and feedback on!

acocac commented 2 years ago

@jamesdamillington the notebook looks great and well-structured! The GitHub actions confirms the proposed executable content is reproducible too, at least in linux OS (:

For the reviewing process, may I ask to transfer the repo to the Environmental-DS-Book organization? This will facilitate reviewers to preview the rendered version, in particular cell outputs with interactive plotting.

fyi, the main stages after the transfer are:

pre-print: the editorial team prepares the notebook repository for the reviewing process. The stage includes ensuring the notebook repo is transferred to the Environmental-DS-Book organisation. The notebook should generate the same outputs as the initial repository hosted in the personal GH account of the contributing author.
review round(s): author(s) and reviewer(s) work together to improve the proposed plain and executable content of the notebook.
post-print: reviewer(s) and editor(s) recommend the notebook for publication. The editorial team will share proofs (the draft of the final formatting).
publication: refers to the dissemination of the notebook in the official communication channels of the project e.g.Twitter.

I hope the above stages are clear. I look forward to finding additional reviewers of your great notebook 🎸 and starting the collaborative reviewing process 🤓

jamesdamillington commented 2 years ago

@acocac Sure, how do we transfer the repo to the EDS Book organization?

acocac commented 2 years ago

@jamesdamillington find some instructions here. In the organization name field you should type Environmental-DS-Book. Let me know if you have any questions!

jamesdamillington commented 2 years ago

Hi @acocac - I tried following instructions at that link to transfer the repo. I got the message:

You don’t have the permission to create public repositories on Environmental-DS-Book

acocac commented 2 years ago

@jamesdamillington apologies I forgot the key step of adding you to the organization 🙃 Can you accept the invite, and then try the transfer? Cheers

jamesdamillington commented 2 years ago

Right! Transfer in progress...

acocac commented 2 years ago

Great! I confirm the transfer is completed. I'll prepare the notebook for the reviewing process. I have some potential reviewers that I'll inform the final ones when we start the step of review round(s). Thanks for your contribution 🚀

acocac commented 2 years ago

@jamesdamillington I'm delighted to mention we have started the reviewing process of your NBI. @annefou and @aedebus will kindly support the revision of the proposed technical and conceptual content of the contribution.

Hope you all have a great collaborative reviewing experience towards a common goal, Open Environmental Science for All 📗 🚀

jamesdamillington commented 2 years ago

Thanks @acocac Do I need to do anything more right now, or just wait for reviews? For example, the submission guidelines state

A maintainer of the EnvDS book will assist you to add the notebook to a new branch in the main repo. After, a pull request will be created. In the PR, you will have to fill a form with a series of questions related to the contribution. Please complete them.

I can see this PR - do I need to merge it now or is that for you/reviewers to do?

acocac commented 2 years ago

@jamesdamillington thanks for the quick reply. Apologies for any confusion with the current submission documentation.

Thanks @acocac Do I need to do anything more right now, or just wait for reviews? For example, the submission guidelines state

You should wait for reviews. I'll update that particular section in the submission guideline as it is not longer required to fill such form (you already provided some context in the NBI). The form instead is filled by one of the EDS maintainers, see PR #110. Once we finish the post-print stage, I (as current maintainer) will merge the PR in the main branch of the EDS book repo.

I can see this PR - do I need to merge it now or is that for you/reviewers to do?

You don't need to merge. It is one of the roles of the editor (again myself 🙃 , hopefully more volunteers joining in the future) to merge the PR to the main branch of the notebook repository. We will do it once you and reviewers agree a satisfatory completion of the review round.

jamesdamillington commented 2 years ago

Great, that's how I essentially understood things, but the text in the submission guidelines was a little confusing. :)

acocac commented 2 years ago

Nw (: the EDS is a work-in-progress community-driven project, we haven't reached version 1.0.0. We then really appreciate any feedback from contributors. You're welcome to suggest changes in the guidelines. We'll provide credits in the contribution types when I add you to the list of contributors.

jamesdamillington commented 2 years ago

Hi @acocac Yes, happy to help develop the submission guidelines, mainly by asking questions to help my understanding of the review process! (and then I can suggest text later).

For example, the Reviewing guidelines state that

the interaction of the authors [and reviewers?] is facilitated through ReviewNB

This would benefit (me at least!) from providing a little more guidance. ReviewNB looks great but I'm not entirely clear on how to use it within the process of deciding on changes (and editing the notebook). For example, I see that I can write responses to comments from reviewers in the text box - should I make a reply about a suggested edit in ReviewNB before making a notebook edit? Or just go ahead and make the notebook edit (see below) as I see fit and then reply? Who clicks the Resolve Conversation button? I assume that's for the reviewer to do once they are happy the comment has been appropriately addressed? How does that link to the editor's responsibility of approving PRs?

Then, the second major issue for me currently: how should I actually make edits to the notebook in response to the reviewer comments? I saw you have made some commits (editing file paths), so I pulled the repo - that has brought in the change that you made to the notebook file (now named general-exploration-landcover_io.ipyn) but not your suggested edits in the notebook itself. I then realised that's because I pulled from the main branch, but your edits within the notebook are on the _reviewround1 branch with an outstanding PR than needs to be merged to main.

As you noted above, you are editor so have responsibility for approving PRs. Do I just checkout the _reviewround1 branch, make edits, and submit PRs that you (as editor) then deal with merging into main? (What it a reviewer doesn't like my edit - which links back to my first set of questions).

Thanks!

acocac commented 2 years ago

Hi @acocac Yes, happy to help develop the submission guidelines, mainly by asking questions to help my understanding of the review process! (and then I can suggest text later).

Thanks for your great feeback and concerns in the current documentation. Let me address your questions below. I just opened the issue #115. You've flagged very important issues in the current documentation. I'll address them in a new PR, and then ask your revision or thoughts according to your availability 🙏

the interaction of the authors [and reviewers?] is facilitated through ReviewNB

Good catch. I have to change it to authors and reviewers.

This would benefit (me at least!) from providing a little more guidance. ReviewNB looks great but I'm not entirely clear on how to use it within the process of deciding on changes (and editing the notebook). For example, I see that I can write responses to comments from reviewers in the text box - should I make a reply about a suggested edit in ReviewNB before making a notebook edit? Or just go ahead and make the notebook edit (see below) as I see fit and then reply? Who clicks the Resolve Conversation button? I assume that's for the reviewer to do once they are happy the comment has been appropriately addressed? How does that link to the editor's responsibility of approving PRs?

It's true the current instructions are vague and confusing 🙈 There is a plenty room for improvement in the proposed guidelines! A dedicated webpage in the EDS book with some visuals or demo might help to both authors and reviewers who aren't familiar with tools as ReviewNB. For authors, how to proceed will depend of the type of comments. For instance, if it's just a general question, you can reply directly. If it's a change which could improve the code or text my suggestion is to make the notebook edit and then reply. We haven't set a policy to resolve conversation button, but I think it's more editor's responsability. Again, we can add it to the improved version of guidelines ✏️

Then, the second major issue for me currently: how should I actually make edits to the notebook in response to the reviewer comments? I saw you have made some commits (editing file paths), so I pulled the repo - that has brought in the change that you made to the notebook file (now named general-exploration-landcover_io.ipyn) but not your suggested edits in the notebook itself. I then realised that's because I pulled from the main branch, but your edits within the notebook are on the _reviewround1 branch with an outstanding PR than needs to be merged to main.

You should make edits in the _reviewround1 branch., this means you have to pull the PR branch and not the main. When we get substantial changes and approval from authors/reviewers, I as editor will close the PR and merge into main.

As you noted above, you are editor so have responsibility for approving PRs. Do I just checkout the _reviewround1 branch, make edits, and submit PRs that you (as editor) then deal with merging into main? (What it a reviewer doesn't like my edit - which links back to my first set of questions).

You must only make edits in the _reviewround1 branch. If the reviewer doesn't like the edit, authors can revert the commit. If the authors aren't sure how to do it, the editor can assist 🔧

acocac commented 2 years ago

@jamesdamillington please find the proof of the notebook and general actions in #110.

According to your response we expect to release it next Monday or the week after.

alan-turing-institute / environmental-ds-book