ImperialCollegeLondon / pyrealm

Development of the pyrealm package, providing an integrated toolbox for modelling plant productivity, growth and demography using Python.
https://pyrealm.readthedocs.io/
MIT License
19 stars 8 forks source link

Use a statistical model fitted to the original dataset to synthesize data #179

Open tztsai opened 7 months ago

tztsai commented 7 months ago

Description

A statistical model (with a linear component and several seasonal components with different periods) was fitted to the large LFS dataset and stored in pyrealm_build_data/data_model_params.nc, by the script pyrealm_build_data/synth_data.py. It can be later used to reconstruct the original dataset with a decent accuracy, replacing the 1.8GB dataset with a 0.3MB file of coefficients.

A testing file was also added in tests/regression/data/ to ensure the quality of the reconstructed dataset by asserting that its R2 score is at least 0.85 (1.0 means perfect).

codecov-commenter commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (caab98b) 93.57% compared to head (c7fa14d) 93.57%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #179 +/- ## ======================================== Coverage 93.57% 93.57% ======================================== Files 28 28 Lines 1464 1464 ======================================== Hits 1370 1370 Misses 94 94 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

tztsai commented 6 months ago

Can you fix the qa issues, please.

I'm not sure about this PR. I think the implementation and the time series model is very neat and the R2 is good, but the resulting coefficients file is 39MB (compressed or not?) and the validation script requires LFS to restore the original 1.6 GB data file.

The main point of #149 was to remove the large dataset and drop LFS, and the retained inputs are ~50MB. We can tile that small dataset to increase the profiling load. This PR adds more variation in the input data, by retaining more of the original signal of the large dataset, but it isn't clear that this alone improves the profiling.

I have fixed the QA now. I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset. After that the profiling can just call the reconstruct function without any usage of LFS. In addition, the previous coefficients file stores features for each time step of the original dataset. Letting the reconstruct function calculate these features dynamically given a custom datetime index, the file only needs to occupy ~300KB.

davidorme commented 6 months ago

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

tztsai commented 6 months ago

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

davidorme commented 6 months ago

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

Yes but this requires that the whole repo retains the use of git lfs so that this flow can be used. And we had deliberately reduced the dataset size to avoid using git lfs.

I think the point here is that we need a block of data to use in profiling:

tztsai commented 6 months ago

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

Yes but this requires that the whole repo retains the use of git lfs so that this flow can be used. And we had deliberately reduced the dataset size to avoid using git lfs.

A simple solution at the moment would be to validate the generated dataset using the reduced (1 year) dataset which does not require LFS.

MarionBWeinzierl commented 6 months ago

When #189 is implemented the data set can also be uploaded to Zenodo.

MarionBWeinzierl commented 6 months ago

Another option would be to use the reduced dataset for validity checking.