Closed bgctw closed 4 months ago
@michaelstrupler and Oscar: Can you, please, have a look at branch collarchunk?
You can installed it by devtools::install_github("bgctw/RespChamberProc", ref="collarchunk")
and vignette switchingChambers?
I introduced the notion of a collar which encapsulates chamber dimensions and lag-time. Different chunks can be associated with different collars. And the collar is by default printed in the facet heading of diagnostic plots together with the chunkId.
Does this help your setup? You can associate your valve with the chamber variable (e.g. by specifying it as as value of argument colIndex
in subsetContiguous
).
Moreover, I introduced parallel fitting of different chunks by using the furrr
package. You can specify the number of processors by plan(multisession, workers = 4)
.
Thanks, it seems to work and I have adapted the script. However, since the implementation of these features, the script [https://github.com/oscarperezpriego/ChamberProc/blob/michael/RespChamberProc/main/Chamber_PICARROv6.r] becomes ultra slow….
I have no Idea what could be the reason. Does package furr replace the package doSNOW we had used for parallel computing before?
On 13 Jun 2024, at 09:29, Thomas Wutzler @.***> wrote:
@michaelstrupler https://github.com/michaelstrupler and Oscar: Can you, please, have a look at branch collarchunk <x-msg://29/bgctw/RespChamberProc/tree/collarchunk>? You can installed it by devtools::install_github("bgctw/RespChamberProc", ref="collarchunk") and vignette switchingChambers <x-msg://29/bgctw/RespChamberProc/blob/collarchunk/vignettes/switchingChambers.md>?
I introduced the notion of a collar which encapsulates chamber dimensions and lag-time. Different chunks can be associated with different collars. And the collar is by default printed in the facet heading of diagnostic plots together with the chunkId. Does this help your setup? You can associate your valve with the chamber variable (e.g. by specifying it as as value of argument colIndex in subsetContiguous).
Moreover, I introduced parallel fitting of different chunks by using the furrr package. You can specify the number of processors by plan(multisession, workers = 4).
— Reply to this email directly, view it on GitHub https://github.com/bgctw/RespChamberProc/issues/6#issuecomment-2164824426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ3TPVI5Q6F6WME55IB2TLLZHFC5TAVCNFSM6AAAAABIYFTMOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRUHAZDINBSGY. You are receiving this because you were mentioned.
It seems that the reason the script gets slow is related to the non-equidistant timesteps somehow. If I create equidistant timesteps again, it calculates a resfit chunk in a few (i.e. < ca. 3) minutes, whereas for non-equidistant timesteps it takes around 10 times longer...
Do you generate a fewer-records time series by creating equidistant time steps?
Maybe then better thin your records before processing (neglecting records before the next second elapsed) rather than interpolating your measurements to seconds.
If you fit concentration changes to very many records, this slowes down computation. But thinning the time series might increase the estimates of uncertainty.
I implemented a thinning to nRecCentral = 20
of the central part of the concentration time series in estimating the uncertainty due to leverage of records at the edges (sigmaBootLeverage
).
Please, reinstall the package from the branch and try, if this modification helps your case.
(devtools::install_github("bgctw/RespChamberProc", ref="collarchunk")
)
Another though: I assume that "fitting a chunk" does not include the time to produce the plot.
The modification above, of course does not decrease plotting time. If plotting is the bottleneck, you or RespChamberProc need to implement a thinning before the plotting.
Another observation: I accidentally wrongly though that you apply calcClosedChamberFluxForChunkSpecs
with a missing lagtime (tlag = NA
).
This is not recommended because its not robust and additionally can slow down computation especially on long time series.
Better determine the lag-time by inspecting the lag-time of a few chunks and subsequently fix it (as you did, as I see now)
In your script you apply a lag-time of 16 seconds for all gases. If you want to vary this by tracegas (as I remember from the videocon), you need to use a different collar_spec
per trace gas.
Actually, the time series get even longer, as the missing timesteps are interpolated, see the first 10 lines of the non-interpolated timesteps (ds_subset) and the interpolated timesteps:
head(ds_subset) DATE TIME FRAC_DAYS_SINCE_JAN1 FRAC_HRS_SINCE_JAN1 JULIAN_DAYS EPOCH_TIME
It does not seem that plotting is the bottleneck...
On 17 Jun 2024, at 16:06, Thomas Wutzler @.***> wrote:
Another though: I assume that "fitting a chunk" does not include the time to produce the plot.
The modification above, of course does not decrease plotting time. If plotting is the bottleneck, you or RespChamberProc need to implement a thinning before the plotting.
— Reply to this email directly, view it on GitHub https://github.com/bgctw/RespChamberProc/issues/6#issuecomment-2173503429, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ3TPVNGHFDTCW2ZINNQNS3ZH3UQDAVCNFSM6AAAAABIYFTMOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGUYDGNBSHE. You are receiving this because you were mentioned.
Thanks for the hint!
On 17 Jun 2024, at 16:12, Thomas Wutzler @.***> wrote:
Another observation: I see that you apply calcClosedChamberFluxForChunkSpecs with a missing lagtime (tlag = NA). This is not recommended because its not robust and additionally can slow down computation especially on long time series. Better determine the lag-time by inspecting the lag-time of a few chunks and subsequently fix it.
— Reply to this email directly, view it on GitHub https://github.com/bgctw/RespChamberProc/issues/6#issuecomment-2173534695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ3TPVIGPO3SYUCWM6OOY7TZH3VDPAVCNFSM6AAAAABIYFTMOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGUZTINRZGU. You are receiving this because you were mentioned.
I am not able to debug the problem, with the provided code (changing pathes, loading other scripts and data from the file system, ...).
For a minimal working example (MWE), please,
filename <- "example.zip"
download.file(
"https://github.com/bgctw/RespChamberProc/blob/master/inst/genData/SMANIE_Chamber1_26032015.zip?raw=TRUE",
filename, mode = "wb")
ds <- readDat(unz(filename, filename = unzip(filename, list = TRUE)[1,"Name"] ),tz = "UTC")
head(ds)
...
Regarding the equidistant-time steps:
Producing a longer time series is problematic. The uncertainty of the estimated flux decreases with the amount of data provided. When providing more records by the interpolation, subsequent uncertainty estimates are biased low.
Thanks for your comments, please find my answers below:
On 20 Jun 2024, at 13:35, Thomas Wutzler @.***> wrote:
I am not able to debug the problem, with the provided code (changing pathes, loading other scripts and data from the file system, ...).
For a minimal working example (MWE), please,
provide a small dataset (~10 chunks, i.e. concentration measurement cycles) at a download location (e.g. inside a github repository https://github.com/bgctw/RespChamberProc/blob/master/inst/genData/SMANIE_Chamber1_26032015.zip) download and load it in your MWE code, e.g. filename <- "example.zip" download.file( "https://github.com/bgctw/RespChamberProc/blob/master/inst/genData/SMANIE_Chamber1_26032015.zip?raw=TRUE", filename, mode = "wb") ds <- readDat(unz(filename, filename = unzip(filename, list = TRUE)[1,"Name"] ),tz = "UTC") head(ds) … I have created a first version of a the Package “ChamberProc” that uses “RespChamberProc” as dependency:
devtools::install_github("michaelstrupler/ChamberProc”)
the sample data is now included in the package, you can address it with “sample_PICARRO_data”.
try to condense your code to the bare necessary steps to replicate the problem. Loading additional packages often show that its not condensed (e.g. provide the temperature with the example dataset created by you, rather than query if from a service) The dataset I got from a colleague (and all the other datasets they have) do not contain these data, therefore we sample these additional values from a service.
only rely on relative pathes and only load results from there that have been produced by the MWE ok
I send you the MWE as .R file per mail. Best, Michael
— Reply to this email directly, view it on GitHub https://github.com/bgctw/RespChamberProc/issues/6#issuecomment-2180456933, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ3TPVLJRCPYFE45ZRXVP5TZIK47BAVCNFSM6AAAAABIYFTMOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBQGQ2TMOJTGM. You are receiving this because you were mentioned.
I installed your package and tried running the Code you sent by email (which is not quite minimal) I get the error on the weather line:
> Additional_Weather_Data <- getAdditionalWeatherVariables(latDeciDeg, lonDeciDeg,format(min(ds$TIMESTAMP),"%Y-%m-%d"),format(max(ds$TIMESTAMP),"%Y-%m-%d"))
Error in st_as_sf(sampling_location, coords = c("lon", "lat"), crs = 4326) :
could not find function "st_as_sf"
Maybe, you need to adjust package imports. I figured out, that the function works after I explicitly load the sf package.
First, I create a dataset from inside your script (but without adjusting timesteps) so that we can discuss.
ds_mwe1 <- select(dsChunk, iChunk, collar, TIMESTAMP, CO2_dry, AirTemp, Pa)
put it to github and use it from an self-contained MWE:
library(RespChamberProc)
library(dplyr)
ds_mwe1 <- readRDS(url("https://github.com/bgctw/RespChamberProc/raw/collarchunk/develop/issue_cases/issue6_mwe1.rds?raw=TRUE"),"rb")
df <- filter(ds_mwe1, iChunk==4)
plot(df$CO2_dry ~ df$TIMESTAMP) # strongly autocorrelated
resFit <- calcClosedChamberFlux(df
, fRegress = c(lin = regressFluxLinear, tanh = regressFluxTanh, exp = regressFluxExp, poly= regressFluxSquare)
, debugInfo = list(omitEstimateLeverage = TRUE) # faster
, colConc = "CO2_dry", colTime = "TIMESTAMP" # colum names conc ~ timeInSeconds
, colTemp = "AirTemp", colPressure = "Pa" #Temp in degC, Pressure in Pa
, volume = 0.4*0.4*0.4, area = 0.4*0.4
, minTLag = 60, maxLag = 120
, concSensitivity = 0.01
)
# thin data
df2 <- slice(df, seq(1, nrow(df), 10))
plot(df2$CO2_dry ~ df2$TIMESTAMP) # still some autocorrelation
resFit2 <- calcClosedChamberFlux(df2
, fRegress = c(lin = regressFluxLinear, tanh = regressFluxTanh, exp = regressFluxExp, poly= regressFluxSquare)
, debugInfo = list(omitEstimateLeverage = TRUE) # faster
, colConc = "CO2_dry", colTime = "TIMESTAMP" # colum names conc ~ timeInSeconds
, colTemp = "AirTemp", colPressure = "Pa" #Temp in degC, Pressure in Pa
, volume = 0.4*0.4*0.4, area = 0.4*0.4
, minTLag = 60, maxLag = 120
, concSensitivity = 0.01
)
rbind(resFit, resFit2)
When I plot the CO2 time series of your first chunk, I see a large autocorrelation. Nonlinear fitting such a time series, especially combined with break-point detection is going to be hard and to take longer.
After thinning the time series to take only each 10th point, there is still some of the autocorrelation. This means we did not throw away much information.
The fitting gets a lot faster. And the results are very similar.
flux fluxMedian sdFlux tLag lagIndex autoCorr AIC sdFluxRegression sdFluxLeverage iFRegress sdResid iqrResid r2 times model
<dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
1 0.204 NA 0.0354 61 50 0.981 -857. 0.0354 NA 2 0.380 0.466 0.951 <dbl [380]> <gnls>
2 0.233 NA 0.0287 62 6 0.909 40.0 0.0287 NA 2 0.381 0.526 0.951 <dbl [38]> <gnls>
Note, that I could not find any connection of the performance related to equidistant time-steps - that you recommended to speed up.
I do not want to implement an automatic thinning of time series in RespChamberProc, but let the user decide. For your case I recommend this thinning (8 compared to 10 before to be conservative).
ds <- ds_mwe1 %>% group_by(iChunk) %>% slice(seq(1, n(), 8)) %>% ungroup()
I ran the above code again without the
debugInfo = list(omitEstimateLeverage = TRUE)
and got
# A tibble: 2 × 15
flux fluxMedian sdFlux tLag lagIndex autoCorr AIC sdFluxRegression sdFluxLeverage iFRegress sdResid iqrResid r2 times model
<dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
1 0.204 0.229 0.0354 61 50 0.981 -857. 0.0354 0.0188 2 0.380 0.466 0.951 <dbl [380]> <gnls>
2 0.233 0.234 0.0460 62 6 0.909 40.0 0.0287 0.0460 2 0.381 0.526 0.951 <dbl [38]> <gnls>
With using the full autocorrelated series, the estimate of the leverage fails and the flux estimate is strongly influenced by the very negative concentration wiggle at the start after the time lag. Omitting just a few records (as implemented) does not omit the first wiggle of high leverage, which induces a bias in the estimated flux.
This can be seen by the large difference between original flux and median of the leverage-bootstrapped flux (fluxMedian
), despite a very low uncertainty (sdFlux).
Hence, it is not only a performance but a larger issue. Critically examine the results of fitting high-autocorrelation time series.
On 24 Jun 2024, at 10:59, Thomas Wutzler @.***> wrote:
I installed your package and tried running the Code you sent by email (which is not quite minimal) I get the error on the weather line:
Additional_Weather_Data <- getAdditionalWeatherVariables(latDeciDeg, lonDeciDeg,format(min(ds$TIMESTAMP),"%Y-%m-%d"),format(max(ds$TIMESTAMP),"%Y-%m-%d")) Error in st_as_sf(sampling_location, coords = c("lon", "lat"), crs = 4326) : could not find function "st_as_sf" Maybe, you need to adjust package imports. I figured out, that the function works after I explicitly load the sf package.
Thanks for the hint, I have added the package “sf” to “imports” in the DESCRIPTION file. So this should work now :)
First, I create a dataset from inside your script (but without adjusting timesteps) so that we can discuss. ds_mwe1 <- select(dsChunk, iChunk, collar, TIMESTAMP, CO2_dry, AirTemp, Pa)
put it to github and use it from an self-contained MWE:
library(RespChamberProc) library(dplyr) ds_mwe1 <- readRDS(url("https://github.com/bgctw/RespChamberProc/raw/collarchunk/develop/issue_cases/issue6_mwe1.rds?raw=TRUE"),"rb") df <- filter(ds_mwe1, iChunk==4) plot(df$CO2_dry ~ df$TIMESTAMP) # strongly autocorrelated resFit <- calcClosedChamberFlux(df , fRegress = c(lin = regressFluxLinear, tanh = regressFluxTanh, exp = regressFluxExp, poly= regressFluxSquare) , debugInfo = list(omitEstimateLeverage = TRUE) # faster , colConc = "CO2_dry", colTime = "TIMESTAMP" # colum names conc ~ timeInSeconds , colTemp = "AirTemp", colPressure = "Pa" #Temp in degC, Pressure in Pa , volume = 0.40.40.4, area = 0.4*0.4 , minTLag = 60, maxLag = 120 , concSensitivity = 0.01 )
thin data
df2 <- slice(df, seq(1, nrow(df), 10)) plot(df2$CO2_dry ~ df2$TIMESTAMP) # still some autocorrelation resFit2 <- calcClosedChamberFlux(df2 , fRegress = c(lin = regressFluxLinear, tanh = regressFluxTanh, exp = regressFluxExp, poly= regressFluxSquare) , debugInfo = list(omitEstimateLeverage = TRUE) # faster , colConc = "CO2_dry", colTime = "TIMESTAMP" # colum names conc ~ timeInSeconds , colTemp = "AirTemp", colPressure = "Pa" #Temp in degC, Pressure in Pa , volume = 0.40.40.4, area = 0.4*0.4 , minTLag = 60, maxLag = 120 , concSensitivity = 0.01 ) rbind(resFit, resFit2) When I plot the CO2 time series of your first chunk, I see a large autocorrelation. image.png (view on web) https://github.com/bgctw/RespChamberProc/assets/23358311/2c9f13ba-af16-4719-80a5-4957b1efe2db Nonlinear fitting such a time series, especially combined with break-point detection is going to be hard and to take longer.
After thinning the time series to take only each 10th point, there is still some of the autocorrelation. This means we did not throw away much information.
The fitting gets a lot faster. And the results are very similar.
flux fluxMedian sdFlux tLag lagIndex autoCorr AIC sdFluxRegression sdFluxLeverage iFRegress sdResid iqrResid r2 times model
1 0.204 NA 0.0354 61 50 0.981 -857. 0.0354 NA 2 0.380 0.466 0.951
2 0.233 NA 0.0287 62 6 0.909 40.0 0.0287 NA 2 0.381 0.526 0.951
Ok, I see, the thinning seems to make a lot of sense in therms of efficiency. however when I use the thinning for the whole dataset that was provided to me, i.e. applying the thinning via "dsChunk %>% group_by(iChunk) %>% slice(seq(1, n(), 8)) %>% ungroup()” to sample_Picarro_data (that is included in the Package), one has to carefully select the tLag window in order for the code to provide no error. Especially for the H2O data that seems to be of very bad quality.
Note, that I could not find any connection of the performance related to equidistant time-steps - that you recommended to speed up.
I do not want to implement an automatic thinning of time series in RespChamberProc, but let the user decide. For your case I recommend this thinning (8 compared to 10 before to be conservative). ds <- ds_mwe1 %>% group_by(iChunk) %>% slice(seq(1, n(), 8)) %>% ungroup()
I agree that it is the best option to let the user decide about the thinning (and the selected parameters). I will create a vignette (my first ever;) for the package adapted for the PICARRO system where I will recommend thinning the data if sample points are that dense.
— Reply to this email directly, view it on GitHub https://github.com/bgctw/RespChamberProc/issues/6#issuecomment-2185974398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ3TPVNPAOBNYVQVLDZLVXDZI7NXTAVCNFSM6AAAAABIYFTMOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVHE3TIMZZHA. You are receiving this because you were mentioned.
Using a common sensor (and logger) that connects to different chambers is a common setup.
Currently, RespChamberProc, assumes all chunks are related to a common collar.
Provide a mapping of chunks to different collars/repIds and add the number somewhere to the plots.