TropCyclone: frequency_from_tracks() with minor inconsistency

simonameiler commented 3 years ago

The frequency_from_tracks() function sets hazard frequencies in the TropCylone Module. I discovered a minor inconsistency for tracks sets of a given time period that contain single tracks crossing the dateline. For example, there are tracks in the WP basin that start in 2018 and continue to 2019. This yields a minor error in the hazard.frequency.

One possible workaround is to correct the frequency of that hazard set manually. Generally, frequencies for IBTrACS are set as "self.frequency = np.ones(self.event_id.size) / (year_delta * ens_size)". When regarding historical records only (no TCTracks.calc_perturbed_trajectories), ens_size = 1. Hence, knowing your year_delta, you can adjust the frequency: hazard.frequency = np.ones(hazard.event_id.size) / year_delta

chahank commented 3 years ago

What exactly is the inconsistency? I do not understand why ens_size would change something.

simonameiler commented 3 years ago

What exactly is the inconsistency? I do not understand why ens_size would change something.

It's not about ens_sizebut about the year_delta. For example, if you read in IBTrACS from 1980 - 2018, thus covering 39 years worth of tracks, the hazard frequency should be set for each track as 1/39 years. But CLIMADA gets the year_delta wrong if a track crosses the dateline to 2019. Then it calculates the frequency as 1/40.

chahank commented 3 years ago

Really? As far as I know, the event date is set from the date of the first data point in the track. Hence, it should not happen that reading IBTracks from 1980-2018 results in a Hazard with an event in 2019. Could you please post an example of where this happens?

tovogt commented 3 years ago

I don't know whether that's the reason for this issue, but I have to admit that read_ibtracs_netcdf filters for the year range before discarding time steps in the track. That means that it can happen that a track originates in 2018 (i.e. some agency started reporting about it in 2018), but the user-selected agency only started to report about that track in 2019.

However, I don't think we should change this. It somehow feels right since the storm did originate in 2018, we just don't have sufficient data to describe what it did in 2018. It might even be that we can describe what it did in 2018, but we still discard the 2018 part because it doesn't satisfy CLIMADA's minimum requirements for data completeness (must have pressure and wind and location).

bguillod commented 3 years ago

Method frequency_from_tracks does not use the event date of the TropCyclone(Hazard) object, but the time coordinate of the TCTracks object data list: Hence it rather uses the minimum and maximum date of all dates within all tracks:

year_max = np.amax([t.time.dt.year.values.max() for t in tracks])
year_min = np.amin([t.time.dt.year.values.min() for t in tracks])

I fully agree that this is an issue and it would be better to have a stricter way of defining the number of years.

Also, I think another disturbing fact is that if you read data for the Southern Hemisphere, you probably don't want to cut-off a season in the middle, so ideally you'd want a date range such as 1980-07-01 to 2020-06-30.

I'm not sure what the best solution is, but one option would be that the TCTracks object contains an attribute indicating the number of years that have been loaded, and set_from_tracks uses that to set frequency. However, the method would then only work via a TCTracks object so that will limit it's applicability in the future... Although that's the same with the current method frequency_from_tracks.

chahank commented 3 years ago

Thanks for the clarification @simonameiler @tovogt @bguillod !

The issue is interesting. Allocating the frequency based on the number of years in a variable fashion is in many ways a flawed approach. This is probably worth a larger discussion. But to keep this issue on track (:D), can we really scientifically justify that this one year fluctuation is relevant?

bguillod commented 3 years ago

Let's imagine a case that currently (admittedly) doesn't exist in CLIMADA, but that potentially could exist with the future API: you query all TropCyclone events that are affecting a given country. If you load data for 1981-2020 (40 years) There could be no event in 1981-1983, 2019 and 2020 in that country, hence the min/max year of your event set are going to be 1984-2018, so 35 years instead of 40. Hence you wouldn't want to estimate frequencies from that, but rather from a total of 40 years.

This case doesn't exist as such now in CLIMADA, but it will likely exist at some point, right? In that case you couldn't set frequencies as is currently done.

tovogt commented 3 years ago

@bguillod That's a very good example!

Regarding the seasons in the southern hemisphere: We could easily add an option for read_ibtracs_netcdf to respect the hemisphere-specific seasonality. I think it's mostly a question about what to use as the default and how to specify this on the API level (something like specifying "left-closed", "right-open" etc.). Feel free to create an issue or, even better, a PR with your ideas.

bguillod commented 3 years ago

Regarding the seasons in the southern hemisphere: We could easily add an option for read_ibtracs_netcdf to respect the hemisphere-specific seasonality. I think it's mostly a question about what to use as the default and how to specify this on the API level (something like specifying "left-closed", "right-open" etc.). Feel free to create an issue or, even better, a PR with your ideas.

My suggestion would be to have an argument date_range instead of year_range. I've implemented that on my end with our database and date_range is a tuple of either integer for years (as year_range currently), string of datetime as 'YYYY-MM-DD' or datetime.datetime instances. The drawback is it forces the user to more thinking. A simpler option would be something like n_seasons and it would pick the past n_seasons seasons (depending on the hemisphere of basin); albeit that wouldn't easily work when you don't specify basin - or it would have to then filter differently for different basins.

chahank commented 3 years ago

My question: is this effort really relevant/worth it? Approximating the frequency by the inverse of the number of years in the chosen record is at best a very very crude approximation. This can lead to very weird behaviour also.

Say I take first years 1981 - 2000 (20 years, frequency 1/20), which contains 20 storms / year. Then, second, I take 1981 - 2001, but 2001 contains only one additional storm. So, the total number of storms is only marginally changed, but the frequency of all storms is changed by 1 inverse year. I do not see why this would be consistent. Hence, having +- 1 year due to seasons or tracks going over the year-end might not be relevant at all.

tovogt commented 3 years ago

@bguillod As I said, if you are interested in a fix for the seasonality thing you have to open an issue or PR. Otherwise, that's not going to happen.

@chahank Suppose that you want to estimate the increase in damages from TCs from 1981-2000 to 2081-2100. Suppose the total damage is $1000 for 1981-2000 and $1030 for 2081-2100 and you have a confidence range of $100 for the projection. Then the two values are not statistically different from each other, because $1000 is contained in the confidence range [$980, $1080]. We would conclude that the 3% increase is not statistically significant. Now, suppose you use CLIMADA to express that damage in terms of "average annual damage" and it gets the frequency wrong for 1981-2000 (21 years), but correct for 2081-2100 (20 years). Then the average annual damage for 1981-2000 is (supposedly) $47.62 and for 2081-2100, it's $51.5 (an 8% increase in annual damages). The confidence range is translated to [$49, $54]. So this time, the two values are statistically different from each other. We would conclude that there is a statistically significant increase in damages of 8%.

chahank commented 3 years ago

@tovogt Thanks for the example. It is very clear, and while I do not disagree, I have to point out the following which itches me: My issue is with your assessment of 'correct' and 'wrong' frequencies. The confidence range of the projections would have to include the error in frequency, which is ignored in the above example. And the error in the frequency cannot be taken from the year range. Say that 1981 was an absolutely exceptional year in terms of events, with an anomalously high number of events. Then, comparing 1981-2000 with 2081-2100, and comparing 1982-2000 with 2082-2100 would yield completely different results as the frequencies have been skewed by the exceptional year 1981.

I would argue that we have for no hazard enough historical data to actually derive the long-term average which would allow to derive a flat frequency. Thus, how relevant is the difference in the frequency with respect to the variance from the choice of year-range?

A possible solution would be to try to define a statistically robust fixed reference year range, together with the calibrated frequencies.

tovogt commented 3 years ago

So, you basically say that we can't make any statements about historical trends in TCs at all because we don't have enough data. That's probably true. There are studies for US hurricanes in 1910-2010 that say that the question about whether there is a significant trend in damages or not does only depend on whether the year 2005 is included in the analyses.

Still, these kinds of analyses are published all the time. You can say that they are bullshit anyway and we shouldn't waste our time with them. Or, you could say we try not to make them even worse by introducing these kinds of inaccuracies. I mean, as a mathematician, I tend to agree with your view. But from what I know about CLIMADA, and from what @davidnbresch said in the past about the target group of users for CLIMADA, I think that this decision might affect quite some research output.

I actually tend to prefer not to have any default frequency values to force the users to decide for their particular use case.

bguillod commented 3 years ago

I agree with @tovogt here. And I think we need to separate two things here:

Assigning a probability to a historical event with pure historical data is tricky, hence not precise. It's only a rough estimate, but it's the best first guess (1/n_years). Everything in life (to exaggerate a bit) is based on that, despite it being not exact. Because nobody can tell you what is the true probability of an event.
Using the approach from (1) but making a mistake in estimating n_years is what it is: a mistake. It should be corrected if it can happen. No matter whether other uncertainties come into play that might in some cases indeed lead to larger errors than this one.

Would you agree?

chahank commented 3 years ago

I agree with: 1 - not defining default frequencies would be good. This is similar to my proposal above to say that only for one specific and calibrate range of IBTrACS years we define them as good as we can. 2 - I agree that in principle it is 'a mistake'. The question is however how far do you want to go for the sake of fulfilling an empty statement, which we first have to define, which is quite some work. For example, which is the "correct" statement: a) "The frequencies are the inverse number of years between the first track in the data set and the last" b) "The frequencies are the inverse of the number of years specified by the user when retrieving data" c)"The frequencies are the inverse number of seasons North/South in the year-range specified by the user" ...? Also, would a track that goes from 23:59 on 31.12.2000 until 22:0 on 03.01.2001 count for 2000 or 2001? You see, there is a very long-tail of questions that can be asked that would lead to 'mistakes', but that are in my opinion all utterly irrelevant. I would also argue that we can just define that how it is implemented now is the correct way, and that's as good as any other way.

Also, @bguillod I do not at all agree with the statement Everything in life (to exaggerate a bit) is based on that, but that is another discussion

chahank commented 3 years ago

@tovogt I think I understand what you mean by But from what I know about CLIMADA, and from what @davidnbresch said in the past about the target group of users for CLIMADA, I think that this decision might affect quite some research output., but I am not sure. Could you expand a bit please?

tovogt commented 3 years ago

So, when we talked about the target group of users of CLIMADA, it seemed to me like we have people who don't want to look into the source code, not even into the footnotes of the docs, but rather have it working right away, throw in some data (or not even that), and just have some projections for climate risks. They might be using Python for the first time, never read a paper about the particular kind of hazard they are analyzing using CLIMADA. Just some people from the economy/insurance/policy/... sector who accidentally publish a paper or write a report for some NGO who don't want to spend more than a few days with the technicalities of CLIMADA.

So that's all I want to say. I don't want to say that this kind of approach is bad or anything. I believe that there are a lot of people out there with this kind of workflow. And there are big decisions and reports being produced on those grounds. It just means that these people would never stop to double-check whether the significant increase in damages might come from some 1981-2000 or 1981-2001 thing (we would probably call this a sensitivity analysis).

chahank commented 3 years ago

Thanks for the details! I agree, we need some default that can be used as is. But, this aspect will probably be addressed in most part by the API. THe users you are talking about will then not even have to create their own Hazards.

CLIMADA-project / climada_python

TropCyclone: frequency_from_tracks() with minor inconsistency #215