davidcarslaw / deweather

Remove the influence of meteorology from atmospheric composition data
http://davidcarslaw.github.io/deweather/
GNU General Public License v2.0
36 stars 11 forks source link

[Bug]: Deweather truncating training data #20

Open meesh-loves-code opened 6 months ago

meesh-loves-code commented 6 months ago

Hi David,

I attended your very interesting and helpful advanced R training program with CASANZ last year. I have been using the deweather package v0.7.2.9101 and have been having some problems getting a sensible answer from it. The package appears to be truncating the training data to 10,000 data points instead of allowing 80% of the data. Is there anyway I can fix this?

Thank you for your help!

testMod(
  clean_480,
  vars = c("trend", "ws", "wd", "rh", "temp", "weekday"),
  pollutant = "pm10",
  train.frac = 0.8,
  n.trees = NA
)
#> Error in testMod(clean_480, vars = c("trend", "ws", "wd", "rh", "temp", : could not find function "testMod"

I'm not sure why the error message has come up in the reprex!

ℹ Optimum number of trees is 4509 ℹ RMSE from cross-validation is 75.36 ℹ Percent increase in RMSE using test data is 73.7%

Created on 2024-05-06 with reprex v2.1.0

image

jack-davison commented 5 months ago

Hello!

Two things:

testMod() limited to 10,000 rows?

On your specific point - testMod() seems to limit to 10,000 rows of data when n.trees == NA - can see the line that does this here:

https://github.com/davidcarslaw/deweather/blob/6c753ed2ff7a88eb1d4e4b16bfe8aa5494772c5c/R/testMod.R#L80C1-L86C1

@davidcarslaw are you able to comment on this? I imagine it's just a performance thing as testMod() works out the optimum number of trees. Note that it's not the first 10,000 or anything like that, it's a sample of 10,000 random observations.

What's a reprex?

A reprex includes everything to recreate your issue, so not just the literal line of code creating the error message. It has likely failed because you haven't loaded {deweather}, nor have you defined clean_480 - if I ran what you've provided on my machine I'd get an error too, which is why your reprex has failed/is incomplete!

In other words:

Bad

thedata %>% group_by(hair_color) %>% summarise(height = mean(height))
#> Error in thedata %>% group_by(hair_color) %>% summarise(height = mean(height)): could not find function "%>%"

Created on 2024-05-19 with reprex v2.1.0

Good

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

thedata <- starwars

thedata %>% group_by(hair_color) %>% summarise(height = mean(height))
#> # A tibble: 12 × 2
#>    hair_color    height
#>    <chr>          <dbl>
#>  1 auburn          150 
#>  2 auburn, grey    180 
#>  3 auburn, white   182 
#>  4 black            NA 
#>  5 blond           177.
#>  6 blonde          168 
#>  7 brown            NA 
#>  8 brown, grey     178 
#>  9 grey            170 
#> 10 none             NA 
#> 11 white           156 
#> 12 <NA>            142.

Created on 2024-05-19 with reprex v2.1.0

Cheers, Jack