ImperialCollegeLondon / pyrealm

Development of the pyrealm package, providing an integrated toolbox for modelling plant productivity, growth and demography using Python.

https://pyrealm.readthedocs.io/

MIT License

21 stars 8 forks source link

Demography components #227

Open davidorme opened 5 months ago

davidorme commented 5 months ago

Description

This is a "not even a WIP" pull request to put up and describe a branch containing some sketchy material for the upcoming work. It includes:

The two example data files from #222 (pfts.csv and community.csv) in pyrealm_build_data.community
A new module pyrealm.community_sketch.py that implements skeletal bits of a Flora and Community class using pandas.DataFrame as storage. Except for using pandas.DataFrame.merge, all the data could be converted into class attributes as numpy.arrays

That code is actually usable for a basic operation:

from importlib import resources
import pandas as pd
from pyrealm.community_sketch import Flora, Community

# Load the data from the build data
dpath = resources.files("pyrealm_build_data.community") 
pft_data = pd.read_csv(dpath / "pfts.csv")
inventory = pd.read_csv(dpath / "community.csv")

# Create the objects
flora = Flora(pft_data=pft_data)
community = Community(flora=flora, inventory=inventory)

# A thing has been calculated
community.inventory['height']

I've revisited some old material on the canopy model, removed some frankly embarrassing inability to broadcast arrays, and got it built into a doc page with some prototype functions. I've added the branch to RTD to make it render:

https://pyrealm.readthedocs.io/en/demography_components/users/tmodel/canopy.html

Type of change

[ ] ~New feature (non-breaking change which adds functionality)~
[ ] ~Optimization (back-end change that speeds up the code)~
[ ] ~Bug fix (non-breaking change which fixes an issue)~
[x] Half-baked sketches of some stuff we might use

Key checklist

[x] Make sure you've run the pre-commit checks: $ pre-commit run -a
[ ] All tests pass: $ poetry run pytest

Further checks

[x] Code is commented, particularly in hard-to-understand areas
[ ] Tests added that prove fix is effective or that feature works

codecov-commenter commented 5 months ago

Codecov Report

Attention: Patch coverage is 34.28571% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 94.00%. Comparing base (89f8ff1) to head (1914668).

Files	Patch %	Lines
pyrealm/community_sketch.py	34.28%	23 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #227 +/- ## =========================================== - Coverage 95.23% 94.00% -1.23% =========================================== Files 28 29 +1 Lines 1701 1736 +35 =========================================== + Hits 1620 1632 +12 - Misses 81 104 +23 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

davidorme commented 5 months ago

What shape are the dataframes that are used to initialise the Flora and Community classes? I'm concerned that this might make for an API that's quite hard to use as the user will have to know exactly what kind of dataframe to input to use the class.

Yup - I think that's just something we'll have to document really carefully. It's an unavoidably complex part of the setup. We could use alternative formats (TOML, maybe) but I don't think that solve much. The __init__ method should do a ton of validation.

Have you thought about using something like pydantic https://docs.pydantic.dev/latest/ for input validation?

I've looked at it - can't remember if it was on pyrealm or a different project. I think at that instant I could only cope with one of mypy or pydantic.

How does time fit into this model, is it calculating the geometry, GPP etc at a given instant?

Really good question. The model basically fits the geometry and canopy for a given instant, given the diameter at breast height (DBH) of each individual in the cohorts/community. The GPP is estimated over a period of time and can then be used to calculate the increase in DBH to grow all the cohorts. At the moment, that is very much a stepped process (using for example GPP over a month), but ultimately it will be more accurate to also allow some kind of an integration scheme to provide continuous growth. We have an similar implementation for that in the Plant-FATE model - but that is down the line.

Have I understood correctly from the docs that the relationship between community and location is one-to-one? If so, allowing multiple locations within one community class may be confusing.

Yes. That is correct. I also agree with you. The downside is that some of the calculations can be run across communities without difficulty (just larger np.arrays) which will be more efficient, but then other parts are location specific (total canopy area, environmental conditions). I think by far the easiest place to start is to have multiple locations as a list of Community objects, each of which contains the cohorts for a single location.

If that proves to be a slow part of the process, we know some ways that we might accelerate it, but at some cost to computational simplicity and readability.

AmyOctoCat commented 5 months ago

I think if we were to use something like the existing types from virtual_ecosystem, this could potentially reduce the difficulty around documenting inputs (we could provide a json schema in the documentation, and the class structure pretty much does the documentation for you) and make validation a bit easier. If we were to use pydantic with those I think we would get quite a lot of validation and deserialisation for free. I've not used it before, but it seems to get very positive reviews and apparently works well with mypy.

As an experiment, I tried out using the calculate_geometry method and switching to using the classes from virtual_ecosystem and TModelTraits (copied over with an extra field or two) to see how it might play out as an initial suggestion. Personally, I think it improves the readability (disclaimer - I haven't tried running this yet), but I'm not sure what the difference would be performance wise, but I think most of the operations here are fairly cheap.


@dataclass
class PlantFunctionalType:
    name: str
    a_hd: float
    ca_ratio: float
    h_max: float
    lai: float
    par_ext: float
    resp_f: float
    resp_r: float
    resp_s: float
    rho_s: float
    sla: float
    tau_f: float
    tau_r: float
    yld: float
    zeta: float

@dataclass
class Cohort:
    dbh: float
    pft: str
    n: int
    cell_id: str

@dataclass
class CalculateGeometryResult:
    cohort: Cohort
    height: float
    crown_area: float
    crown_fraction: float
    mass_stm: float
    mass_fol: float
    mass_swd: float

class Community:
    def __init__(self, flora: Flora, cohorts: list[Cohort]) -> None:
        self.flora = flora
        self.cohorts = cohorts

    def calculate_geometry(self) -> list[CalculateGeometryResult]:
        results = []

        for cohort in self.cohorts:
            pft = self.flora.get(cohort.pft)

            height = pft.h_max * (1 - np.exp(-pft.a_hd) * cohort.dbh / pft.h_max)

            # Crown area of tree, Equation (8) of Li ea.
            crown_area = ((np.pi * pft.ca_ratio / (4 * pft.a_hd)) * cohort.dbh * height)

            # Crown fraction, Equation (11) of Li ea.
            crown_fraction = height / (pft.a_hd * cohort.dbh)

            #Masses
            mass_stem = (np.pi / 8) * (cohort.dbh ** 2) * height * pft.rho_s
            mass_fol = crown_area * pft.lai * (1 / pft.sla)
            mass_swd = crown_area * pft.rho_s * height * (1 - crown_fraction / 2) / pft.ca_ratio

            results.append(
                CalculateGeometryResult(cohort, height, crown_area, crown_fraction, mass_stem, mass_fol, mass_swd))

        return results

I agree in terms of having multiple community objects, I think the improved readability should take priority in that case.

davidorme commented 5 months ago

I think something like that could work really well. The canopy calculation is a property of a whole community, rather than an individual cohort, so I'm wondering if it could just be methods within Community. On balance, though it is complex enough that having a canopy module that defines a Canopy class and then just have the attribute Community.canopy: Canopy is probably cleaner and better for unit testing.

davidorme commented 5 months ago

I'm leaning away from having a Cohort class at all. It's a neater abstraction of the individual rows in the community data but I can't immediately see that having that abstraction gains anything. Almost everything within a community either must be or can be applied using array calculations across all cohorts.

AmyOctoCat commented 5 months ago

I'm leaning away from having a Cohort class at all. It's a neater abstraction of the individual rows in the community data but I can't immediately see that having that abstraction gains anything. Almost everything within a community either must be or can be applied using array calculations across all cohorts.

I think there's value in the abstraction of a cohort from the point of view of understanding what the input data needs to be like, particularly for someone who's new to the library. Whether or not the cohort class is used, a user will still need to understand conceptually what each row in the community array represents in order to use the library with their own data. I think representing that as a class is useful because it forces that concept to be solidified.