[Question]: Why is openair getting the plot wrong?

davidcarslaw / openair

Tools for air quality data analysis

https://davidcarslaw.github.io/openair/

GNU General Public License v2.0

305 stars 113 forks source link

[Question]: Why is openair getting the plot wrong? #358

Open GoodLug opened 1 year ago

GoodLug commented 1 year ago

I am using openair to plot wind roses from MIDAS/CEDA (UK Met Office Archive) data. The data is supplied from a csv file with datetime ws wd columns with hourly data for an entire month for a station, in the examples below Isle of Portland July 2021.

I have used three methods to do the plots, as checks for correct plotting: spreadsheet where the data is kept, windrose/matplotlib in python, and openair in R. The openair plot appears to be wrong, despite the underlying frequency counts being correct. In the following image, openair on the left, spreadsheet in the middle and windrose/matplotlib on the right. Not the wind speed bins (and styling) are not the same, but that doesn't matter because the openair error is in the % from each direction:

Note the W and SW winds: in the openair plot, they are both a little over 20%, in the spreadsheet and windrose/matplotlib versions they are different (24% and 19%). The latter two are correct, the former is incorrect.

If I look at the underlying openair data, it does give the correct frequency counts in the freq column, but the Interval4 column, which appears to be the % frequency used in the plot, is wrong. The table below shows the openair table on the left, the histogram as determined by the spreadsheet on the right:

As can be seen, the frequency counts agree, but the frequency percentages don't.

The code I am using is (temp <- gets the data which is then printed via tmp$data):

tmp <- windRose(portland2, angle=45, width=0.2, key.footer="knots", angle.scale = 022, main="Isle of Portland Jul 2021") print(tmp$data)

Any ideas as to why this is happening?

jmclaren-17 commented 1 year ago

Is it something to do with the way each package deals with calm winds?

GoodLug commented 1 year ago

Thanks for your quick reply. That is certainly a possibility, I will have a look at the data, and also how calms are defined and dealt with. Strictly speaking, they shouldn't have a direction, but if they are defined as say less that 2 knots, a 1 knot wind will have a direction, albeit a rather meaningless one.

davidcarslaw commented 1 year ago

Thanks for the message. I just tested it on the data set you mention. Code is below. I think the plot is consistent with the data.

Note that on calms, openair assumes the direction bin is zero degrees (360 is for non-calm winds from the north). The associated wind speed associated with calm is zero and like you say, there is no meaningful wind direction.

If there is a problem, we'll make sure it is fixed...It might help to email your actual data.

Thanks David

met <- worldmet::importNOAA("038570-99999", 2021)
library(openair)
windRose(selectByDate(met, month = 7))
tmp <- windRose(selectByDate(met, month = 7), angle=45, width=0.2, key.footer="knots", angle.scale = 022, main="Isle of Portland Jul 2021")
print(tmp$data)
# A tibble: 9 × 10
  default          Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd freqs
  <fct>                <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int>
1 01 July 2021 to…     0          0         0         0     -999   0.3 5.0445      -95.6     2
2 01 July 2021 to…     0.672      2.15      3.76      4.57    45   0.3 5.0445      -95.6    34
3 01 July 2021 to…     1.21       4.30      8.60     14.8     90   0.3 5.0445      -95.6   110
4 01 July 2021 to…     0.806      3.09      4.70      6.05   135   0.3 5.0445      -95.6    45
5 01 July 2021 to…     1.08       4.70      7.26      8.47   180   0.3 5.0445      -95.6    63
6 01 July 2021 to…     0.806      3.76      8.20     17.5    225   0.3 5.0445      -95.6   130
7 01 July 2021 to…     1.34       7.53     11.3      24.3    270   0.3 5.0445      -95.6   181
8 01 July 2021 to…     2.96      13.3      15.9      16.3    315   0.3 5.0445      -95.6   121
9 01 July 2021 to…     0.806      4.70      7.39      7.80   360   0.3 5.0445      -95.6    58

GoodLug commented 1 year ago

David - that's interesting, and looks more consistent with the other plots. I downloaded the data from MIDAS/CEDA (UK) but I have also looked at the NOAA data (US) for UK stations and in the tests I've done it has been the same - but maybe not, In the MIDAS data I have been plotting, there are no ws = 0 values, no wd = 0 values, and four wd = 360 values.

In the interval4 column in your analysis of the NOAA data, the % values are close but not the same as my MIDAS values, but the noticeable thing is that, unlike in my MIDAS data, there are 0.3% calms for all wind directions (! - meaning a total of 0.3 * 8 = 3.4% calms - surely it is 0.3% of all 'directions') so either the data isn't exactly the same or the packages handle the calms differently, either in how they are defined and/or allocated. Openair I believe uses ws or wd = 0 as the definition of calms (and it can't easily be changed), windrose.py I thonk defaults to ws = 0 but it can be changed.

I will have a look at the NOAA data and compare it to the MIDAS data I have.

GoodLug commented 1 year ago

The NOAA site seems to be down at the moment but I do have some previously downloaded data covering Isle of Portland Jul 2021 and the first obvious difference is the wind speeds are floats whereas the MIDAS ones are ints but that should only affect wind speed bin allocation (I think). There are however some ws=0 entries, 2 that I can see wd=360/ws=0 entries in the NOAA data which have ws=1 in the MIDAS data. But that is only 2 values out of 744...but 2/744 is 0.2688% or 0.3% if rounded, so that's where the 0.3% calms comes from. But the differences in the plots are an order of magnitude greater.

Still trying to get my head round this. My MIDAS based plots all start with the same data file, so they should be the same - unless the float variant of ws makes a difference, - but it should not affect direction? Wind direction is to the nearest 10 degrees in both NOAA and MIDAS data. Later (have some things I need to do this pm) I will do a line by line comparison of the two data sets to confirm the wd entry is always the same.

GoodLug commented 1 year ago

I haven't been able to get worldmet to work, probably an NOAA problem as it can be very flaky:

met <- worldmet::importNOAA("038570-99999", 2021) Missing data for site 038570-99999 and year 2021 [1] "site(s) do not exist."

but I was able to manually download a csv file (https://www.ncei.noaa.gov/data/global-hourly/access/2021/03857099999.csv) after doing a map search on https://www.ncei.noaa.gov/access/search/data-search/global-hourly and given the numbers are the same (038570) I dare to presume it is the same data ie I have in a csv file exactly what you had in the 'met' data frame using the above code. To compare that to the MIDAS/CEDA data that I used

midas-open_uk-hourly-weather-obs_dv-202207_dorset_01319_isle-of-portland_qcv-1_2021.csv

which came from here (you may need to register, free to do, to see the page/files):

https://data.ceda.ac.uk/badc/ukmo-midas-open/data/uk-hourly-weather-obs/dataset-version-202207/dorset/01319_isle-of-portland/qc-version-1

I did the following

(a) checked the lat/long and elevation match: they do, or are at least near enough: NOAA 50.5166666/-2.45/52 vs MIDAS 50.522/-2.456/52 (the station IDs also sort of match, both contain [0]3857[0])

(b) split the NOAA WND column (for some inexplicable reason the wind data is lumped into one column) and extracted the speed and direction data into new columns, then converted speed from m/s x 10 (I think the x 10 is a legacy thing, floats stored ans ints etc to save space) to knots rounded to integers (to match the MIDAS data)

(c) cleaned up both the NOAA and MIDAS data so both had only date/time, wind direction and wind speed columns, and only July rows

(d) visually compared them side by side: they are the same. I confirmed this by subtracting the NOAA wd from the MIDAS wd and ditto for ws, and in all cases the result was zero except 2 wd values in the middle of the month where MIDAS has 0 and NOAA has -360 (an interesting concept I am still trying to head my head round. Perhaps the wind was blowing backwards from the north?).

Thus for all intents and purposes the MIDAS and NOAA data for July 2021 for the station 'Isle of Portland' are the same. Any differences in the plots must therefore be assumed to be down to differences in the plotting code. I will now have a closer look at what is going on there, more specifically why my local openair plot of my local data is the 'odd man out'.

Edit: I got ahead of myself in the comment about the -360 wind direction: that is the value in the subtraction cell, the MIDAS and NOAA values are 0 and 360 respectively...

GoodLug commented 1 year ago

Well, I know have a bit more information, but not necessarily clarity.

I have redone all the plots, paying great attention to data sanitation and organisation, and the openair anomaly in the plots (W and SW bars being visually all but identical, see image in original post) has now disappeared, the SW bar is now visibly less than the W bar, in keeping with the other plots:

But the frequencies percentages are still different. To compare the plots, I have pulled out the underlying data for the openair and python plots, and also got the same data from WRPlot View (a useful freeware windrose plotting software) and compared them side by side. The important thing is that these are all totally independent of each other, yet in an ideal world all will agree 100% if they are doing the sums right.

What we can see here is that while the percentages are different, the underlying counts are not. Note that python's windrose module as run by me doesn't have any calms, they end up in the N bin, and I had to reverse engineer the python counts from the percentages, as the module doesn't report the actual counts. But, apart from that, the counts are identical, as are the percentage frequencies for the python and WRPlot View outputs.

Because the counts are all the same, I think all three bits of code are doing the correct direction bin allocation, but openair is doing something different to convert the counts to percentages.

It is not a huge difference, openair seems to be around up to +/- 10% out (sometimes higher, sometimes lower), but it niggles a bit - I like to know why things are different. Of course, openair may be correct and the other two wrong, I just don't know. But it would be good if they all, as completely independent bits of code, agreed.

davidcarslaw commented 1 year ago

Thanks! I agree that this needs to be nailed. Strangely you get different openair percentages than I do (repeated below) - in the Interval4 column e.g. NE freq = 34 and percentage is 4.57 - same as python and WRPLOT in your table. What version of R/openair are you using?

round(tmp$data$Interval4, 3)
[1]  0.000  4.570 14.785  6.048  8.468 17.473 24.328 16.263  7.796

default          Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd freqs
  <fct>                <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int>
1 01 July 2021 to…     0          0         0         0     -999   0.3 5.0445      -95.6     2
2 01 July 2021 to…     0.672      2.15      3.76      4.57    45   0.3 5.0445      -95.6    34
3 01 July 2021 to…     1.21       4.30      8.60     14.8     90   0.3 5.0445      -95.6   110
4 01 July 2021 to…     0.806      3.09      4.70      6.05   135   0.3 5.0445      -95.6    45
5 01 July 2021 to…     1.08       4.70      7.26      8.47   180   0.3 5.0445      -95.6    63
6 01 July 2021 to…     0.806      3.76      8.20     17.5    225   0.3 5.0445      -95.6   130
7 01 July 2021 to…     1.34       7.53     11.3      24.3    270   0.3 5.0445      -95.6   181
8 01 July 2021 to…     2.96      13.3      15.9      16.3    315   0.3 5.0445      -95.6   121
9 01 July 2021 to…     0.806      4.70      7.39      7.80   360   0.3 5.0445      -95.6    58

GoodLug commented 1 year ago

Interesting I've noticed another difference that may or may bot be relevant - see the first column in my table:

> print(tmp$data)
# A tibble: 9 x 10
  default  Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd     freqs
  <fct>        <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int[1d]>
1 all data     0          0         0         0     -999   0.3 9.8051      -95.6         2
2 all data     0.454      1.36      1.51      5.14    45   0.3 9.8051      -95.6        34
3 all data     0.484      1.69      2.90     13.3     90   0.3 9.8051      -95.6       110
4 all data     0.605      1.21      2.87      6.80   135   0.3 9.8051      -95.6        45
5 all data     0.242      1.81      3.02      7.62   180   0.3 9.8051      -95.6        63
6 all data     0          1.97      2.87     19.7    225   0.3 9.8051      -95.6       130
7 all data     0.605      2.42      5.20     21.9    270   0.3 9.8051      -95.6       181
8 all data     0.756      6.50     13.0      18.3    315   0.3 9.8051      -95.6       121
9 all data     0.121      1.33      3.27      7.02   360   0.3 9.8051      -95.6        58

The other difference between our plots and tables is the way we access the data and pull it into R, but I think that is probably a red herring, as the counts are the same, the difference happens at the point where the counts become percentages.

I do also have to own up to being a bit lazy about updating my core R installation, it is according to 'version' 4.2.1. I dread updates because of breaking changes but in the interests of nailing this I will do the update and report back.

BTW, any thoughts on why I am unable to access the data using worldmet, see earlier comment for error message, selectByDate also fails? Maybe it is also because I am using an old version of R...

PS openair is 2.17-0, which I believe is the current version

davidcarslaw commented 1 year ago

It's hard to know at this stage where the problem is at the moment. I will try on a different machine...

the importing of met data works OK for me. Sometimes the NOAA website is down.

> met <- worldmet::importNOAA("038570-99999", 2021)
> head(met)
# A tibble: 6 × 25
  code    station date                latitude longitude  elev    ws    wd air_temp atmos_pres
  <fct>   <fct>   <dttm>                 <dbl>     <dbl> <dbl> <dbl> <dbl>    <dbl>      <dbl>
1 038570… ISLE O… 2021-01-01 00:00:00     50.5     -2.45    52   5.1    60      1.7      1009 
2 038570… ISLE O… 2021-01-01 01:00:00     50.5     -2.45    52   4.1    60      1.5      1009.
3 038570… ISLE O… 2021-01-01 02:00:00     50.5     -2.45    52   4.1    50      1.5      1010.
4 038570… ISLE O… 2021-01-01 03:00:00     50.5     -2.45    52   4.1    60      1.5      1010.
5 038570… ISLE O… 2021-01-01 04:00:00     50.5     -2.45    52   4.6    50      0.7      1011.
6 038570… ISLE O… 2021-01-01 05:00:00     50.5     -2.45    52   2.1    40      0.6      1011.
# ℹ 15 more variables: visibility <dbl>, dew_point <dbl>, RH <dbl>, ceil_hgt <dbl>,
#   cl_1 <dbl>, cl_2 <dbl>, cl_3 <dbl>, cl <dbl>, cl_1_height <dbl>, cl_2_height <dbl>,
#   cl_3_height <dbl>, precip_12 <dbl>, precip_6 <dbl>, pwc <chr>, precip <dbl>

GoodLug commented 1 year ago

I have now updated R to the latest version, just about seems to be OK, also updated R Studio, but that is broken on Win 7 (64bit Pro) so have had to revert to my previous 2022 version, with the plot viewing panel currently broken. This is why I hate updates, more often than not they break things. I'm old enough to remember the old adage, if it ain't broken, don't fix it, and I positively detest the modern concept of built in obsolescence. Rant over.

I have now run openair on the latest R version, 4.3.1 (see bottom of screen grab below), and as you can see I am still getting the same tibble as before, with the Interval4 numbers still slightly out of whack:

  default  Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd     freqs
  <fct>        <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int[1d]>
1 all data     0          0         0         0     -999   0.3 9.8051      -95.6         2
2 all data     0.454      1.36      1.51      5.14    45   0.3 9.8051      -95.6        34
3 all data     0.484      1.69      2.90     13.3     90   0.3 9.8051      -95.6       110
4 all data     0.605      1.21      2.87      6.80   135   0.3 9.8051      -95.6        45
5 all data     0.242      1.81      3.02      7.62   180   0.3 9.8051      -95.6        63
6 all data     0          1.97      2.87     19.7    225   0.3 9.8051      -95.6       130
7 all data     0.605      2.42      5.20     21.9    270   0.3 9.8051      -95.6       181
8 all data     0.756      6.50     13.0      18.3    315   0.3 9.8051      -95.6       121
9 all data     0.121      1.33      3.27      7.02   360   0.3 9.8051      -95.6        58
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)

Worldmet is still not working, but agree that may just be NOAA flakiness:

> met <- worldmet::importNOAA("038570-99999", 2021)
Missing data for site 038570-99999 and year 2021
[1] "site(s) do not exist."

davidcarslaw commented 1 year ago

This is strange indeed. Sorry you had to update to no avail! I tried this on two Mac systems and a Windows machine and everything is consistent with what I show at the top of this discussion. I'm going to have to investigate this further. Could you email your R object to me directly i.e. your portland2 data? use saveRDS(portland2, file = "portland2.rds"); send to david.carslaw@york.ac.uk?

Anyone else out there able to verify differing behaviours?

I would agree with "if it ain't broken, don't fix it", but equally if it is broken, then it needs fixing!

Thanks for helping to bottom this out...

GoodLug commented 1 year ago

I agree, we need to get to the bottom of this!

Thinking this through, on paper, we are both doing the same analysis using presumably the same code on the same data, and so we should get the same results - but we are not. However, I think we can narrow down where the discrepancy lies. All three of my methods, openair, python windrose and WRPlot, as well as your openair method, all get the same frequency count results. That suggest. the starting point, the raw data is the same, and the methods for doing the counts are all effectively the same, regardless of the exact detail of how they get there. So they should be, it's not rocket science! The point at which the discrepancy first appears is in the frequency percentages, in effect when they are normalised to a 0 to 100 range.

This is where the complexity of code in R is not our friend. It is a common problem, not just with R. Last winter I started dabbling with Home Assistant, quite possibly the largest python tu*d on the planet, in an attempt to monitor a heat pump, and gave up because it is just to freaking complicated. It also suffers from far too frequent 'updates' which invariably break things. I now have just a few lines of my own simple python code adding the data as text to a single csv file. None of that database nonsense is required, it just makes for pointless complexity. Modern computing has forgotten the virtues of simplicity. In the present case, openair has I believe at least 58 dependencies, and who knows, perhaps they each have 58 dependencies, and before we know it, it is turtles all the way down, for ever and ever, world without end.

But in this case, I believe we can narrow it down, to the code that converts the frequency counts to frequency percentages. In other words, we know which turtle we need to dissect. I have had a look at the windrose code in openair but can't readily see where the conversion gets done (or perhaps called from somewhere else). I think that is what we need to look at, at least as a starting point.

At the end of the day converting frequency counts to frequency percentages is about as simple as it gets: for each row, it is (row count divided by total count) multiplied by 100. Here they are, done via simple spreadsheet cell formulas as shown in the 'Correct %' column:

All four methods, python windrose, openair windrose, WRPlot and the spreadsheet should all produce the same result. But they don't, openair on my system produces different a result. As I said, I think we need to compare the code we each have on our respective machines to see if there are differences. If you can point me towards where I can find that code, I will post it here and we can make a comparison.

GoodLug commented 1 year ago

Staying in this thread as this is part of the problem (how to access exactly the same data): I have now had a look at the code inside importNOAA and can see the url that gets called. With the Portland 2021 data is is:

https://www.ncei.noaa.gov/data/global-hourly/access/2021/03857099999.csv

which is put together using:

file.name <- paste0( "https://www.ncei.noaa.gov/data/global-hourly/access/", year, "/", gsub(pattern = "-", "", code), ".csv" )

and the url is a valid url, if I put it into my browser I get a csv file (03857099999.csv).

But when it is called by importNOAA

met_data <- try(suppressWarnings(read_csv( file.name, col_types = cols_only( STATION = col_character(), DATE = col_datetime(format = ""), SOURCE = col_double(), etc

I always get:

importNOAA("038570-99999", 2021) Missing data for site 038570-99999 and year 2021 [1] "site(s) do not exist."

which comes from this inside importNOAA:

if (class(met_data)[1] == "try-error") { message(paste0("Missing data for site ", code, " and year ", year)) met_data <- NULL return() }

ie it does attempt to get the data, but fails, even though the data exists and can be downloaded via a browser...

It also seems the import() function in openair itself (the one that is supposed to be able to import and tidy up local csv files) has been disappeared...

mooibroekd commented 1 year ago

Anyone else out there able to verify differing behaviours?

No, I am able to copy David's results:

library(openair)

met <- worldmet::importNOAA("038570-99999", 2021)

windRose(selectByDate(met, month = 7))


tmp <- windRose(selectByDate(met, month = 7), angle=45, width=0.2, key.footer="knots", angle.scale = 022, main="Isle of Portland Jul 2021")


print(tmp$data)
#> # A tibble: 9 × 10
#>   default  Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd
#>   <fct>        <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl>
#> 1 01 July…     0          0         0         0     -999   0.3 5.0445      -95.6
#> 2 01 July…     0.672      2.15      3.76      4.57    45   0.3 5.0445      -95.6
#> 3 01 July…     1.21       4.30      8.60     14.8     90   0.3 5.0445      -95.6
#> 4 01 July…     0.806      3.09      4.70      6.05   135   0.3 5.0445      -95.6
#> 5 01 July…     1.08       4.70      7.26      8.47   180   0.3 5.0445      -95.6
#> 6 01 July…     0.806      3.76      8.20     17.5    225   0.3 5.0445      -95.6
#> 7 01 July…     1.34       7.53     11.3      24.3    270   0.3 5.0445      -95.6
#> 8 01 July…     2.96      13.3      15.9      16.3    315   0.3 5.0445      -95.6
#> 9 01 July…     0.806      4.70      7.39      7.80   360   0.3 5.0445      -95.6
#> # ℹ 1 more variable: freqs <int[1d]>

round(tmp$data$Interval4, 3)
#> [1]  0.000  4.570 14.785  6.048  8.468 17.473 24.328 16.263  7.796

^{Created on 2023-08-14 with reprex v2.0.2}

I am not sure why @GoodLug is having problems with reading the met data. However, if the browser shows a CSV file, this does not automatically mean that R is able to download the file from the internet. Then again, updating seems to be working. Did you update all your packages as well? You also mentioned you had issues with RStudio, can you try running the code from R instead of using RStudio?

GoodLug commented 1 year ago

I'm still not having any luck with worldmet:

library(worldmet) met <- worldmet::importNOAA("038570-99999", 2021) Missing data for site 038570-99999 and year 2021 [1] "site(s) do not exist."

As far as I know, all the packages in use are up to date (I say in use, because I haven't done a global update, but have checked the ones in use). Specifically, I am using openair v 2.17-0 and worldmet 0.9.8. I don't think RStudio will be to blame, The version I am using, 2023.03.0 may be a few months old, but it works on Win 7 and at the end of the day it is a front end, the calculations are still done in R itself, and my R version is the latest version (4.3.1).

To refocus on the original (and still present) problem: openair is getting the percentages (and as a result, the plot) wrong when I run it. The underlying counts are correct. The problem happens when the counts are converted to percentages. This is bizarre, because all the complex stuff eg allocating the hourly values into speed/direction bins has already been done, and done correctly, the counts are correct.

There is also another odd thing, despite the plots being on the face of it derived using the same code on the same data: the mean wind speed is very different, 5.0445 vs 9.8051 in my openair plot. The correct value (calculated in an ordinary spreadsheet to ludicrous precision) for the Midas/NOAA data is 9.805107527. Perhaps openair calculates some sort of weighted mean, but whatever it does it gets the wrong answer. I should add that something nearer 10 rather than 5 is the more credible answer, we are talking about ordinary summer winds in the Channel, and a mean of 5 knots is not really credible. I am sure this year (2023) it will have been more than 10 knots, as it was a very windy (and wet) July. The hourly obs for this year from the Met Office WOW website looks like this:

The same chart for July 2021 looks like this

with a mean (calculated in a spreadsheet) of 9.78 knots.

I have to confess that I have spent more time recently trying to get the plots done in python, using a matplotlib bar chart plotted on a radial (polar) axis. This solves the fantail (Florence Nightibgale's coxcomb/bat wing) problem so often evident on wind rose plots, where the area as opposed to radial distance makes the stronger winds far too visually dominant. At least openair (and the original Admiralty wind roses, which I have always liked) get this right, the stronger winds are only a bit more visually dominant, and while this is inaccurate for wind speed (knots), in a way it is more representative of the wind force, which is in proportion to the square of the wind speed. My current plots look like this (the bins are based on a 'sailing scale', too little wind, a good sailing wind, and too much wind):

This is based on MIDAS rather than NOAA data, but I have pretty much established (see post above) that they are for all intents and purposes the same. The 2021 SW and W winds are where they should be, 17.5 and 24.4% respectively. One of the interesting things to come out of this is the prevailing winds are in fact westerlies, rather than the more commonly supposed south westerlies.

mooibroekd commented 1 year ago

Can you share your R code that you used for reading the NOAA csv file? I want to make sure we both have the same input. The interesting point here is that I am able to get the same results as Python and WRPLOT using {openair}. So, in order to be different I can think of two possibilities: the data is different or some underlying packages are in need of updating.

Please note that while {worldmet} and {openair} are up to date, these packages rely on other packages as well.

GoodLug commented 1 year ago

I normally plot UK MIDAS (CEDA) data rather than US NOAA data but I have compared the two data sets and they are effectively the same (see previous post here). However, as I have both sets of data, I can plot both, and the resulting plots and tables are the same. The R code I used is (I'm 99.9% sure this is the original code I used, it was still in the R file, just commented out as I revised the code for later plots. I tend to keep old code because all too often new code breaks things...):

library(openair)

portland2 <- read.csv('E:\\Weather Data\\Portland Misc\\3col-portland-jul-2021-MIDAS.csv')

maintitle <- 'Isle of Portland (MIDAS) Jul 2021'

tmp <- windRose(portland2, angle=45, width=0.2, key.footer='knots', angle.scale = 022, main = maintitle)

print(tmp$data)

All I need to do to plot the NOAA data is change MIDAS in the code to NOAA. The csv files are cleaned 3 column tables with the three columns being datetime, wd and ws. Although the data is the same, there is one very small difference: whereas the MIDAS csv file has a space between the date and the time, the NOAA csv file has a T between the date and time. Apart from that, they are identical, and as can be seen below, they produce the same plots and tables.

Curiously, when I run the code now, I can't exactly replicate my original plot (perhaps because I updated my R core inbetween?), but I still don't get the right plot, though I do get the right mean. Here are the two plots, with the underlying data tables below:

MIDAS table above, NOAA table below:

> print(tmp$data)
# A tibble: 9 x 10
  default  Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd     freqs
  <fct>        <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int[1d]>
1 all data     0          0         0         0     -999   0.3 9.8051      -95.6         2
2 all data     0.454      1.36      1.51      5.14    45   0.3 9.8051      -95.6        34
3 all data     0.484      1.69      2.90     13.3     90   0.3 9.8051      -95.6       110
4 all data     0.605      1.21      2.87      6.80   135   0.3 9.8051      -95.6        45
5 all data     0.242      1.81      3.02      7.62   180   0.3 9.8051      -95.6        63
6 all data     0          1.97      2.87     19.7    225   0.3 9.8051      -95.6       130
7 all data     0.605      2.42      5.20     21.9    270   0.3 9.8051      -95.6       181
8 all data     0.756      6.50     13.0      18.3    315   0.3 9.8051      -95.6       121
9 all data     0.121      1.33      3.27      7.02   360   0.3 9.8051      -95.6        58

> print(tmp$data)
# A tibble: 9 x 10
  default  Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd     freqs
  <fct>        <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int[1d]>
1 all data     0          0         0         0     -999   0.3 9.8051      -95.6         2
2 all data     0.454      1.36      1.51      5.14    45   0.3 9.8051      -95.6        34
3 all data     0.484      1.69      2.90     13.3     90   0.3 9.8051      -95.6       110
4 all data     0.605      1.21      2.87      6.80   135   0.3 9.8051      -95.6        45
5 all data     0.242      1.81      3.02      7.62   180   0.3 9.8051      -95.6        63
6 all data     0          1.97      2.87     19.7    225   0.3 9.8051      -95.6       130
7 all data     0.605      2.42      5.20     21.9    270   0.3 9.8051      -95.6       181
8 all data     0.756      6.50     13.0      18.3    315   0.3 9.8051      -95.6       121
9 all data     0.121      1.33      3.27      7.02   360   0.3 9.8051      -95.6        58

Once again, the last column, frequency counts, is correct, but Interval4, the frequency percentages, is out.

I did wonder whether there is something going on with the way calms are handled, resulting somehow in some sort of rotation, with values thereby ending up in the wrong bin, but discounted that because the counts are correct, the error only appears when the counts are converted to percentages. Furthermore, the calms, all 2 of them, represent only 0.3 percent of all the obs, but the percentage errors are greater eg 21.9 for W (above, wrong) vs 24.4 (correct).

GoodLug commented 1 year ago

I have now bitten the bullet, and updated as best as I can all the packages I have installed. Most updated OK, but there were warnings for Matrix, mgcv, nlme and purrr - "cannot remove prior installation of package" and "Permission denied" for each one, looks like previous version was reinstated. This is the sort of nonsense up with which we, or at least I, will not put. Life is too short...

mooibroekd commented 1 year ago

I suggest running the update command in the R gui (not RStudio) with administrator privileges. Depending on your setup, RStudio might have loaded some packages by default, preventing them from being updated.

I ran your statement for getting the NOAA data: met <- worldmet::importNOAA("038570-99999", 2021) on three different machines in three different networks and all the results are the same. The fact that you are not able to download the data might be caused by the same reasons why your calculations are off.

Curiously, when I run the code now, I can't exactly replicate my original plot

This is what worries me and perhaps points even more towards your R setup being (part) of the cause of this anomaly.

GoodLug commented 1 year ago

"This is what worries me and perhaps points even more towards your R setup being (part) of the cause of this anomaly."

Indeed, I am sure it is at least partly something to do with my setup! I have now updated the four 'permission denied' packages in R gui and they updated, thanks for the suggestion, I am also sure you are right that RStudio was somehow getting in the way of the updates.

Running the code with all packages up to date, I still get an anomalous result (same as I got earlier today, not the original error).

We still have two anomalies:

(1) my percentages (but not counts) are out, by various amounts, depending on when I ran the code

(2) the means displayed differ between my plots and yours and David's, by a considerable amount. In this case, I am pretty sure, for the reasons given above, I am getting the correct mean, and you are getting an erroneous mean.

mooibroekd commented 1 year ago

Are the difference in the means not related to differences in the units of the ws? The documentation for importNOAA clearly states that ws is imported as m/s. I see you are talking about knots.

As for the difference in the percentages, what if openair drops NA (missing values) for ws and wd and therefor uses a different total for the percentages?

GoodLug commented 1 year ago

The difference in the means - I think you are right, there is about a factor of 2 difference between m/s and knots (1m/s = 1.94384 knots). I had forgotten, I converted the NOAA data to knots right at the start so I was working in the same units in all the data.

There are very few missing data in recent csv files. If you look at the python wind roses I posted earlier today, you will see I have (xxx obs) in the title of the subplots, which is the number of valid obs used. A full 24h x 31d month is 744 obs, meaning there was one missing value in the 2020 data set, and two in the 2021 set. In the MIDAS data, they show up as wd 0, ws 0, in the NOAA data they show up as wd 360, ws 0. Thus in the MIDAS data, the wd ranges from 0 to 360, while in the NOAA data it ranges from 10 to 360, which could make it appear one set was rotated a bit compared to the other. But the key thing is my R code runs get the right counts, it is only the percentages that are wrong.

All that said - I have now confused myself as to how the two data sets represent calms as opposed to true missing values... In reality, though, it is the ws that defines a calm, meaning, as we have said before, wd neither matters nor has any meaning in a calm.

The other thing is I use pandas and numpy in python to process the data, and the code I use handles missing values as 'NaN' (not a number) meaning meaning ws=0 is a calm, ws=NaN is a missing value. I'm not sure how R deals with this side of things.

But the key thing is my R code runs get the right counts, it is only the percentages that are wrong.

GoodLug commented 1 year ago

I think I have narrowed down the windrose code to the bit that converts the frequency counts to percentages. It calls it 'normalise', ie if normalise is true, then do this


  if (normalise) {
    vars <- grep("Interval[1-9]", names(results))
    results$freq <- results[[max(vars)]]
    results$freq <- ave(results$freq, results[type], FUN = function(x) x/sum(x))
    results$norm <- results$freq/max(results$freq)
    results[, vars] <- results[, vars]/results[[max(vars)]]
    stat.lab <- "Normalised by wind sector"
    stat.unit <- ""
  }

Lines four and five in the above code both have a 'results divided by max (sum, all) results' which is the standard way of getting a proportion normalised to between 0 and 1, which can be multiplies by 100 to get percentages. But it is all so wrapped up in R gobbledegook that I have no idea what it really does. But that bit of code has to be the prime suspect for where the correct frequency counts get converted to the wrong percentages.

I still think the different ways that NOAA and MIDAS may or may not record calms and missing data, and consequently how R handles the data, is a red herring, because my code produces identical results on both data sets, ie in practice the data is treated identically (see output tables six posts ago).

There are other differences in the data output tables as well that may provide a clue. Here are David's output table (above) and mine (below):

default          Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd freqs
  <fct>                <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int>
1 01 July 2021 to…     0          0         0         0     -999   0.3 5.0445      -95.6     2
2 01 July 2021 to…     0.672      2.15      3.76      4.57    45   0.3 5.0445      -95.6    34
3 01 July 2021 to…     1.21       4.30      8.60     14.8     90   0.3 5.0445      -95.6   110
4 01 July 2021 to…     0.806      3.09      4.70      6.05   135   0.3 5.0445      -95.6    45
5 01 July 2021 to…     1.08       4.70      7.26      8.47   180   0.3 5.0445      -95.6    63
6 01 July 2021 to…     0.806      3.76      8.20     17.5    225   0.3 5.0445      -95.6   130
7 01 July 2021 to…     1.34       7.53     11.3      24.3    270   0.3 5.0445      -95.6   181
8 01 July 2021 to…     2.96      13.3      15.9      16.3    315   0.3 5.0445      -95.6   121
9 01 July 2021 to…     0.806      4.70      7.39      7.80   360   0.3 5.0445      -95.6    58

  default  Interval1 Interval2 Interval3 Interval4    wd  calm panel.fun mean.wd     freqs
  <fct>        <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <chr>       <dbl> <int[1d]>
1 all data     0          0         0         0     -999   0.3 9.8051      -95.6         2
2 all data     0.454      1.36      1.51      5.14    45   0.3 9.8051      -95.6        34
3 all data     0.484      1.69      2.90     13.3     90   0.3 9.8051      -95.6       110
4 all data     0.605      1.21      2.87      6.80   135   0.3 9.8051      -95.6        45
5 all data     0.242      1.81      3.02      7.62   180   0.3 9.8051      -95.6        63
6 all data     0          1.97      2.87     19.7    225   0.3 9.8051      -95.6       130
7 all data     0.605      2.42      5.20     21.9    270   0.3 9.8051      -95.6       181
8 all data     0.756      6.50     13.0      18.3    315   0.3 9.8051      -95.6       121
9 all data     0.121      1.33      3.27      7.02   360   0.3 9.8051      -95.6        58

The last column, the frequency counts, are identical, but all the Interval columns are different, despite running the same code on the same data... For example, in both tables the count of winds from the NE (315 degrees) is 121, but in my table the corresponding Interval4 value (the percentage) is 18.3 (incorrect), whereas in David's table it is 16.3% (correct, as compared with the same calculations done in other software). The panel.fun column is also different, but that looks like it is the mean wind speed for all the data, and we have established the result differs because we have used different units for wind speed - but that shouldn't, and doesn't, have any effect on counts of wind direction (last columns are the same).

Lastly, it may (or may not) be relevant that the percentage errors appear random rather than systematic.