WISE-Developers / Project_issues

This handles incoming tickets like bugs and feature requests
GNU Affero General Public License v3.0
2 stars 0 forks source link

[WISE FR]: More stat's in multi-band TIF files #131

Open RobBryce opened 2 years ago

RobBryce commented 2 years ago

Contact Details

rbryce@heartlandsoftware.ca

What is needed?

Currently, multi-band TIF files contain:

The request here is to add bands for standard deviation and standard error, to improve post-processing analysis. There will be no cost to the WISE team for this work, it'll be a contribution.

How will this improve the project or tool?

More complete statistics available to post-analysis tools.

TODO

BadgerOnABike commented 1 year ago

I'll continue to rail that we should be including the median. As many of these variables, the mean is relatively meaningless as the data is not normal.

RobBryce commented 1 year ago

@BadgerOnABike Summary of discussion privately: The concern of providing median from within the WISE executable is the memory required to calculate it. Specifically, if there are 100 sub-scenarios to combine, then we need to store 100 layers of sub-scenarios, for each output we want a median for, to be able to compute it - lots of memory 10's of GB's or RAM.

In addition to this, we have tools to combine these multi-layer grids later on, so that scenarios (not just sub-scenarios) can be combined in meaningful manners. At this point, these multi-band TIF's do not store any individual sub-scenario outputs, just the results of the combinations. To combine many multi-layer grids and then compute a real median (not median of medians), then all sub-scenario outputs need stored for later re-analysis. That will produce very large multi-band TIF files to recombine. However, it would provide all data to look at each sub-scenario individually (which sort of defeats the purpose of a sub-scenario).

BadgerOnABike commented 1 year ago

Perhaps we need to pour over the code of Burn-P3 because it allows us to acquire any percentile we desire and doesn't consume all RAM until many hundreds of thousands of runs are being completed. Additionally, if we aren't providing a full suite of statistics with the sub scenarios, their utility arguably goes down as we cannot determine anything about the distribution from the mean considering the normality assumption of that statistic isn't met. That is of course assuming that BP3 is calculating it correctly and isn't acquiring it from some other way?

RobBryce commented 1 year ago

From memory (since I have audited that code in the past, and the code may have been updated):

BurnP3 uses a variable-sized (auto-sized) array per grid cell, so it loses information of which fire provided which value (which isn't important for median calculations). This potentially reduces overall memory usage but at the expense over more overhead per cell, and slower insertions of data, and a whole lot of memory fragmentation. It is a viable approach for in-memory, though, particularly if fires are relatively small w/r to the overall dimensions of your plot. And I recall reporting an issue where the median for RAZ was not being calculated correctly (treated as linear data rather than circular, but I don't recall if you care about median RAZ).

WISE doesn't limit the stat's to the closed set that BurnP3 does. And the BurnP3 approach would need a non-standard file format to export this data to calculate medians of combined datasets.

You don't need to retain the complete dataset to calculate any of mean, standard deviation, or standard error.

RobBryce commented 1 year ago

I'm ready to merge this (standard deviation and standard error) back in, ready for evaluation. Alberta Parks contribution.

spydmobile commented 1 year ago

@RobBryce what is the status of this work? Is it complete?

RobBryce commented 1 year ago

Standard deviation and standard error stat's were added and received a lot of testing. Outside validation may not hurt once others are generating sub-scenarios. No work on median values has been performed since that wasn't part of the original ticket text, and budget for this work was limited to std dev and std err.

BadgerOnABike commented 1 year ago

Am I correct in thinking this is when fires are burned iteratively and we are calculating the mean by pixel across a range of weather parameters or is the mean / sd / se coming form another place? I'm unclear as to how I would perform testing of these metrics, though I am interested in doing so.

RobBryce commented 1 year ago

The output of a scenario TIF file (with sub-scenarios) is a multi-band TIF. We only added a few more bands for sd/se. Existing bands were listed above. An export from a regular scenario is single-band. An export from a scenario with even one sub-scenario is multi-band. Sub-scenarios may have different weather, or other different parameters too. But for our work, it is typically iterating through weather.

BadgerOnABike commented 1 year ago

I get that part, I'm curious what is being averaged here. Multiple scenarios / subscenarios is how I'm understanding it, is that correct?

RobBryce commented 1 year ago

Yes, for whatever stat is requested

BadgerOnABike commented 1 year ago

Alright, then I'd be able to fairly easily replicate. I presume then for making means you're simply adding to divide by the number of scenarios at the end.

For standard deviation you would require all the layers to subtract from the mean. Wouldn't we then have the same data required for median?

RobBryce commented 1 year ago

We are using Welford's method, identified https://stackoverflow.com/questions/895929/how-do-i-determine-the-standard-deviation-stddev-of-a-set-of-values, which also has links to https://www.johndcook.com/blog/standard_deviation/. We don't need to store the complete dataset for these stat's. This way, a known change to memory consumption occurs, where-as if we are storing all data from all simulations, we cannot necessarily predict memory consumption.

BadgerOnABike commented 1 year ago

Interesting, I do see some methods to calculate a rolling median as well. I'll continue my search, until then I think what we have should work. I guess I'll find out when I go to test them again!

spydmobile commented 1 year ago

@RobBryce is the original work (not the Median) Completed? If so, Is this ready for testing? If so assign it to @BadgerOnABike and label it "Needs Testing". Otherwise this is outstanding, as this was a contribution. Also, this will need some kind of attribution which we need to resolve before we can close this.

RobBryce commented 1 year ago

The original work has been used for a while now. Once @BadgerOnABike can run projects, he can validate. Or, we can provide outputs. Either way we had to validate it some months ago. @lizzydchappy can provide specifics, but I believe attribution should go to Alberta Parks.

BadgerOnABike commented 1 year ago

Answer to the question of "Does the median matter in HFI"

TLDR: Yes

I summarised data for multiple decades of Alberta fire weather history in 3 ways. Everything, everything in June, everything in June at station C3, they all show the same general trend. Massively 0 inflated data yielding a negatively exponential distribution. This will matter most when performing models with stochastic weather information or running the system in a mode to determine any kind of probabilistic output.

All data: Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 86.28 1920.72 4832.14 7003.20 206482.70

image

June: Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 42.06 949.31 4273.83 6263.56 132813.10

image

June at C3: Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 8.59 320.28 2273.88 2532.01 47268.11

image

RobBryce commented 1 year ago

That's great work. Now we know. :)