wqbc incorrectly assumes that MDL_UNIT is the same as UNIT in rems data

joethorley commented 3 years ago

wqbc currently assumes that MDL_UNIT is UNIT in rems data but this is not always the case. Consider selenium concentrations on the elk river.

There are two possible fixes.

The first is to alter the rems database so that the two are always the same.

The second is to update wqbc to require both and adjust independently. This is quite a lot of work to ensure the unit types are consistent and convertible and to perform the conversion and to provide informative error messages when this is not the case.

ateucher commented 3 years ago

So we have confirmation that this can't be altered in the database at this time, so the fix will have to happen here, or in rems. Here's a table of all the non-matching units in the two columns - we'll have to manually populate a third column with a conversion factor:

library(rems)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

con <- connect_historic_db()
#> Please remember to use 'disconnect_historic_db()' when you are finished querying the historic database.
hist_tbl <- attach_historic_data(con)

unit_tbl_hist <- hist_tbl %>% 
  select(UNIT, MDL_UNIT) %>% 
  filter(UNIT != MDL_UNIT) %>% 
  distinct() %>% 
  collect()

unit_tbl_2yr <- get_ems_data("2yr") %>% 
  select(UNIT, MDL_UNIT) %>% 
  filter(UNIT != MDL_UNIT) %>% 
  distinct()
#> Fetching data from cache...

unit_tbl <- bind_rows(unit_tbl_hist, unit_tbl_2yr) %>% 
  distinct()

knitr::kable(unit_tbl, "simple")

UNIT	MDL_UNIT
ug/m3	mg/m3
m	mm
g/m2	ug/cm2
mg/L	ug/L
ug/g wet	ug/g
mg/dm2/d	ug/cm2/d
ppm A	mg/m3
uS/cm	mho/cm
ug/g	mg/L
t/d	kg/d
mg/L	ppm W
m3/d	E6IG/d
m3/d	USG/d
m3/d	E3m3/d
ug/g	kg/t
m3/d	L/d
m3/s W	m3/d
adt/d	t/d
m3/d	IG/d
uS/cm	umho/cm
%(W/W)	%
mg/dm2/d	ug/dm2/m
ug/g	mg/kg
mg/dm2/d	ug/dm2/d
mg/L	None
mg/L	Vehicles
m3/s W	USG/min
m3/d	E6L/d
CFU/100mL	L/min
m3/s W	m3/min
mg/L	%
ueq/L	mg/L
ug/g	%
ug/g wet	mg/kg
m3/d	m3/min W
m3/d	m3/s W
ug/g	ug/L
ug/g wet	mg/kg wet
mg/L	ng/L
mg/L	mg/dm2/d
m3/s W	L/min
mg/dm2/d	mg/dm2/m
m3/d	m3/w
C	F
m3/min A	m3/s A
m3/s W	IG/d
mg/L	ug/cm2
ug/g	ug/g wet
AU/cm	mg/L
mg/L	mg
mg/L	g/L
ug/m3	ug
ug/g	mg/kg wet
% (Mortality)	%
kPa	mm Hg
MPN/g	MPN/100gm
mg/L	mg/kg
m3/s W	L/s
None	N/A
mg/L	% Saturation
NTU	JTU
m3/min A	m3/s W
MPN/100gm	F
%	% (Mortality)
CFU/100mL	mg/L
m3/d	m3/m
ppt	ug/kg
C	h
% (Mortality)	%(V/V)
mg/L	ug
mg/L	mg/m2
m3/d	mV
%	%(W/W)
mg/L	m3/d
mg/L	ug/g
g/m2	mg/m2
pH units	Col. Unit
m3/d	m3/y
ug/g wet	mg/L
m3/d	uS/cm
kg/adt	mg/L
mg/dm2/d	mg/L
m3/s W	m3/h
ug/m3	t/d
MPN/100mL	F
%(V/V)	%
None	uS/cm
m	ft
t/d	mg/L
mg/L	ml/L
kg/adt	% (Recovery)
%(V/V)	% (Mortality)
CFU/100mL	MPN/100mL
MPN/100mL	mg/L
mg/L	N/A
m3/d	mg/L
%(W/W)	mg/L
MPN/100mL	ug/g
pH units	mg/g
C	N/A
g/m2	ug/L
ug/g	kg/adut
t/d	adt/d
uS/cm	mg/L
Col. Unit	pH units
m	m3/s W
ug/g	None
ug/m3	mg/L
pH units	None
Col. Unit	SWU
kg/adt	kg/d
mg/L	uS/cm
ug/m3	mg/dm2/d
mg/L	MPN/100mL
%(W/W)	% (Mortality)
mg/L	C
C	mg/L
adt/d	Col. Unit
C	pH units
pH units	mg/L
t/d	Ton/d
mg/L	kg/d
MPN/100mL	CFU/100mL
ml/L	ug/L
m	N/A
pH units	uS/cm
%	ppm (S)

^{Created on 2021-02-09 by the reprex package (v1.0.0)}

joethorley commented 3 years ago

wqbc has a convert_values() function which recognizes quite a few units and may deal with most of cases - it would be useful to add a count of the frequency of those various unit combinations (for which both values are provided) to the table above for scoping the magnitude of the problem - a lot are not-transferable and must be errors ie C into pH Units.

https://rdrr.io/github/bcgov/wqbc/man/convert_values.html

ateucher commented 3 years ago

Good point. There are definitely some that are nonsensical (mg/L <=> Vehicles??). Here they are with counts:

library(rems)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

con <- connect_historic_db()
#> Please remember to use 'disconnect_historic_db()' when you are finished querying the historic database.
hist_tbl <- attach_historic_data(con)

unit_tbl_hist <- hist_tbl %>% 
  filter(UNIT != MDL_UNIT) %>% 
  count(UNIT, MDL_UNIT) %>% 
  collect()

unit_tbl_2yr <- get_ems_data("2yr", dont_update = TRUE) %>% 
  filter(UNIT != MDL_UNIT) %>% 
  count(UNIT, MDL_UNIT)
#> Fetching data from cache...

unit_tbl <- bind_rows(unit_tbl_hist, unit_tbl_2yr) %>% 
  group_by(UNIT, MDL_UNIT) %>% 
  summarise(n = sum(n)) %>% 
  arrange(-n)
#> `summarise()` has grouped output by 'UNIT'. You can override using the `.groups` argument.

knitr::kable(unit_tbl, "simple")

UNIT	MDL_UNIT	n
mg/L	ug/L	1751998
ug/m3	mg/m3	118938
t/d	kg/d	78182
m3/s W	USG/min	68540
m3/d	E3m3/d	35648
ug/g	mg/kg	23386
m3/d	E6L/d	11060
ug/g wet	mg/kg	9743
m	mm	6166
m3/d	IG/d	5529
g/m2	ug/cm2	4661
m3/s W	L/min	3817
ug/g	ug/L	3187
m3/d	USG/d	2830
ueq/L	mg/L	2745
CFU/100mL	L/min	1819
ug/g wet	mg/kg wet	1731
m3/d	L/d	1585
ppm A	mg/m3	1484
mg/L	mg/dm2/d	1443
ug/g wet	ug/g	1070
m3/s W	m3/min	1038
uS/cm	umho/cm	977
m3/s W	m3/d	768
%(W/W)	%	690
m3/d	m3/min W	684
ug/g	mg/L	582
uS/cm	mho/cm	499
mg/L	ppm W	423
mg/dm2/d	ug/dm2/d	345
mg/dm2/d	ug/dm2/m	343
ug/g	%	330
m3/d	m3/s W	315
mg/L	ug/cm2	230
ug/g	kg/t	230
ug/m3	ug	223
m3/s W	IG/d	219
mg/dm2/d	ug/cm2/d	203
adt/d	t/d	190
mg/L	%	186
m3/d	E6IG/d	176
mg/L	None	166
mg/dm2/d	mg/dm2/m	155
ug/g	ug/g wet	151
m3/s W	L/s	130
mg/L	ng/L	106
m3/min A	m3/s A	99
kPa	mm Hg	84
mg/L	mg	74
mg/L	g/L	70
m3/d	m3/w	60
None	N/A	60
AU/cm	mg/L	57
g/m2	mg/m2	57
%	%(W/W)	40
C	F	31
C	h	31
mg/L	ug/g	31
m	ft	29
ppt	ug/kg	28
CFU/100mL	mg/L	27
t/d	mg/L	25
MPN/g	MPN/100gm	24
ug/g	mg/kg wet	24
mg/L	mg/m2	22
mg/L	mg/kg	21
CFU/100mL	MPN/100mL	20
m3/min A	m3/s W	18
mg/L	% Saturation	18
% (Mortality)	%(V/V)	16
mg/L	Vehicles	15
NTU	JTU	15
pH units	Col. Unit	14
m3/d	uS/cm	13
m3/d	m3/y	12
m3/s W	m3/h	11
mg/L	ug	11
mg/L	m3/d	9
%	% (Mortality)	8
g/m2	ug/L	8
MPN/100mL	mg/L	8
ug/g	kg/adut	8
ug/g wet	mg/L	8
%(V/V)	% (Mortality)	7
ug/g	None	7
%(V/V)	%	6
m3/d	m3/m	6
kg/adt	mg/L	5
mg/dm2/d	mg/L	5
pH units	mg/L	5
ug/m3	mg/dm2/d	5
ug/m3	t/d	5
uS/cm	mg/L	5
C	N/A	4
m	m3/s W	4
m3/d	mV	4
mg/L	N/A	4
None	uS/cm	4
%	ppm (S)	3
MPN/100mL	ug/g	3
% (Mortality)	%	2
%(W/W)	mg/L	2
C	mg/L	2
Col. Unit	SWU	2
mg/L	MPN/100mL	2
mg/L	uS/cm	2
MPN/100mL	CFU/100mL	2
pH units	mg/g	2
pH units	None	2
%(W/W)	% (Mortality)	1
adt/d	Col. Unit	1
C	pH units	1
Col. Unit	pH units	1
kg/adt	% (Recovery)	1
kg/adt	kg/d	1
m	N/A	1
m3/d	mg/L	1
mg/L	C	1
mg/L	kg/d	1
mg/L	ml/L	1
ml/L	ug/L	1
MPN/100gm	F	1
MPN/100mL	F	1
pH units	uS/cm	1
t/d	adt/d	1
t/d	Ton/d	1
ug/m3	mg/L	1

^{Created on 2021-02-10 by the reprex package (v1.0.0)}

joethorley commented 3 years ago

Units that are currently recognised by wqbc and can be used for permitted conversions:

c("ng/L", "ug/L", "mg/L", "g/L", "kg/L", "pH", "degC", "C", "CFU/dL", "MPN/dL", "CFU/100mL", "MPN/100mL", "CFU/g", "MPN/g", "CFU/mL", "MPN/mL", "Col.unit", "Rel", "NTU", "m", "uS/cm")

Most cases are dealt with by mg/L and ug/L.

ateucher commented 3 years ago

In terms of internal package structure and future extensibility (i.e., adding more conversions as needed), would it be more efficient to use a lookup table (i.e., above, with a column of conversion factors)?

joethorley commented 3 years ago

I think it would be more efficient to expand the units recognized by wqbc - the code to do the conversions is relatively elegant.

All we have to do is to add them to lookup_units(), get_unit_multiplier () and get_unit_type().

convert_values <- function(x, from, to, messages) {
  from <- substitute_units(from, messages = messages)
  to <- substitute_units(to, messages = messages)

  x <- x * get_unit_multiplier(from) / get_unit_multiplier(to)

  bol <- from != to & get_unit_type(from) != get_unit_type(to)
  bol <- is.na(bol) | bol

  if (any(bol)) {
    warning(sum(bol), " values have inconvertible units")
    is.na(x[bol]) <- TRUE
  }
  x
}

lookup_units <- function() {
  c(
    "ng/L", "ug/L", "mg/L", "g/L", "kg/L", "pH", "degC", "C",
    "CFU/dL", "MPN/dL", "CFU/100mL", "MPN/100mL", "CFU/g", "MPN/g", "CFU/mL", "MPN/mL",
    "Col.unit", "Rel", "NTU", "m", "uS/cm"
  )
}

get_unit_multiplier <- function(x) {
  units <- c(
    "ng/L" = 10^-9, "ug/L" = 10^-6, "mg/L" = 10^-3,
    "g/L" = 1, "kg/L" = 10^3,
    "pH" = 1,
    "degC" = 1, "C" = 1, "NTU" = 1,
    "Col.unit" = 1, "Rel" = 1,
    "CFU/dL" = 1, "CFU/100mL" = 1, "CFU/mL" = 0.01, "MPN/dL" = 1,
    "MPN/100mL" = 1, "MPN/mL" = 0.01, "CFU/g" = 0.01, "MPN/g" = 0.01,
    "m" = 1,
    "uS/cm" = 1
  )
  x <- units[x]
  names(x) <- NULL
  x
}

get_unit_type <- function(x) {
  type <- list(
    "concentration" = c("ng/L", "ug/L", "mg/L", "g/L", "kg/L"),
    "pH" = "pH",
    "Colour" = c("Col.unit", "Rel"),
    "Temperature" = c("degC", "C"),
    "Turbidity" = "NTU",
    "Coli" = c("CFU/dL", "CFU/100mL", "CFU/mL", "MPN/dL", "MPN/100mL", "MPN/mL", "CFU/g", "MPN/g"),
    "Distance" = "m",
    "Conductivity" = "uS/cm"
  )

  type <- unlist(type)
  names <- sub("\\d$", "", names(type))
  values <- type
  type <- names
  names(type) <- values

  x <- type[x]
  names(x) <- NULL
  x
}

ateucher commented 3 years ago

I was looking at it, and I agree it's very elegant! One thing I wonder is if there are units within a unit type that are incompatible, and how we would treat those

joethorley commented 3 years ago

Can you provide an example?

ateucher commented 3 years ago

I can't provide an example because I'm just not sure - I don't know what a lot of those units even are or what type of quantity they represent... @JessicaPenno @KarHarker can you provide input on what unit combinations in the above table are important and/or don't make sense?

One other question (which I haven't thought through, so this may be a silly question) - Using the multiplier approach currently implemented could get complicated when both the numerator and denominator are different (e.g., ug/g <=> mg/kg).

ateucher commented 3 years ago

I wonder if we can leverage the {units} package?

joethorley commented 3 years ago

{units} may be the way to go but I suspect it will not have all the units used by EMS ie (MPN) plus the multiplier approach should work with different numerators and denominators if the units are simple multiples of each other for example 1 ug/g = 1 mg/kg so in this case the relative multiplier is 1 (we would just need to hard code in all combinations of the numerators and denominators).

ateucher commented 3 years ago

It doesn't have all the units, but it's easy to define new units and set conversion factors. Here's some experimenting I did, I think it looks pretty good. It takes care of the vast majority of them, and the rest are either really specific (e.g. relying on the molar mass of the chemical etc) or errors/nonsensical:

library(rems)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(units)
#> udunits system database from /usr/local/share/udunits

install_symbolic_unit("MPN")
install_conversion_constant("MPN", "CFU", 1)
install_symbolic_unit("NTU")
install_conversion_constant("NTU", "JTU", 1)
install_conversion_constant("US_liquid_gallon", "USG", 1)
install_conversion_constant("UK_liquid_gallon", "IG", 1)
# These petroleum measures are actually only ever used
# as volumes (1m3 = 0.001 E3m3 = 1e-6), but can't use E3m3 
# to install the unit so have to do it as m, units takes 
# care of the rest when it is m3 <=> E3m3 etc.
install_conversion_constant("m", "E3m", .1)
install_conversion_constant("m", "E6m", .01)
install_conversion_constant("L", "E6L", 1e-6)
#> set_units(set_units(1, "m3"), "E3m3")
# 0.001 [E3m3]
#> set_units(set_units(1, "m3"), "E6m3")
# 1e-06 [E6m3]

convert_values2 <- function(x, from, to) {
  to <- clean_unit(to)
  from <- clean_unit(from)
  x <- units::set_units(x, from, mode = "standard")
  as.numeric(units::set_units(x, to, mode = "standard"))
}

clean_unit <- function(x) {
  # Remove trailing A, W, wet etc. as well as percent type (V/V, W/W, Moratlity)
  # Assuming they are not imporant in the unit conversion??
  gsub("\\s*(W|wet|A|\\(W/W\\)|\\(V/V\\)|\\(Mortality\\))\\s*$", "", x)
}

con <- connect_historic_db()
#> Please remember to use 'disconnect_historic_db()' when you are finished querying the historic database.
hist_tbl <- attach_historic_data(con)

unit_tbl_hist <- hist_tbl %>% 
  filter(UNIT != MDL_UNIT) %>% 
  count(UNIT, MDL_UNIT) %>% 
  collect()

unit_tbl_2yr <- get_ems_data("2yr", dont_update = TRUE) %>% 
  filter(UNIT != MDL_UNIT) %>% 
  count(UNIT, MDL_UNIT)
#> Fetching data from cache...

unit_tbl <- bind_rows(unit_tbl_hist, unit_tbl_2yr) %>% 
  group_by(UNIT, MDL_UNIT) %>% 
  summarise(n = sum(n)) %>% 
  arrange(-n)
#> `summarise()` has grouped output by 'UNIT'. You can override using the `.groups` argument.

unit_tbl$converted_from_1 <- NA_real_

for (i in seq_len(nrow(unit_tbl))) {
  unit_tbl$converted_from_1[i] <- tryCatch(
    convert_values2(1, unit_tbl$UNIT[i], unit_tbl$MDL_UNIT[i]), 
    error = function(e) NA_real_
  )
}
#> Warning: Could not parse expression: '`N`/'. Returning as a single symbolic
#> unit()

#> Warning: Could not parse expression: '`N`/'. Returning as a single symbolic
#> unit()

#> Warning: Could not parse expression: '`N`/'. Returning as a single symbolic
#> unit()

knitr::kable(arrange(unit_tbl, is.na(converted_from_1), desc(n)))

UNIT	MDL_UNIT	n	converted_from_1
mg/L	ug/L	1751998	1.000000e+03
ug/m3	mg/m3	118938	1.000000e-03
t/d	kg/d	78182	1.000000e+03
m3/s W	USG/min	68540	1.585032e+04
m3/d	E3m3/d	35648	1.000000e-03
ug/g	mg/kg	23386	1.000000e+00
m3/d	E6L/d	11060	1.000000e-03
ug/g wet	mg/kg	9743	1.000000e+00
m	mm	6166	1.000000e+03
m3/d	IG/d	5529	2.199692e+02
g/m2	ug/cm2	4661	1.000000e+02
m3/s W	L/min	3817	6.000000e+04
m3/d	USG/d	2830	2.641720e+02
ug/g wet	mg/kg wet	1731	1.000000e+00
m3/d	L/d	1585	1.000000e+03
ug/g wet	ug/g	1070	1.000000e+00
m3/s W	m3/min	1038	6.000000e+01
m3/s W	m3/d	768	8.640000e+04
%(W/W)	%	690	1.000000e+00
m3/d	m3/min W	684	6.944000e-04
mg/dm2/d	ug/dm2/d	345	1.000000e+03
ug/g	%	330	1.000000e-04
m3/d	m3/s W	315	1.160000e-05
ug/g	kg/t	230	1.000000e-03
m3/s W	IG/d	219	1.900534e+07
mg/dm2/d	ug/cm2/d	203	1.000000e+01
ug/g	ug/g wet	151	1.000000e+00
m3/s W	L/s	130	1.000000e+03
mg/L	ng/L	106	1.000000e+06
m3/min A	m3/s A	99	1.666670e-02
kPa	mm Hg	84	7.500616e+00
mg/L	g/L	70	1.000000e-03
g/m2	mg/m2	57	1.000000e+03
%	%(W/W)	40	1.000000e+00
m	ft	29	3.280840e+00
ppt	ug/kg	28	1.000000e-03
ug/g	mg/kg wet	24	1.000000e+00
CFU/100mL	MPN/100mL	20	1.000000e+00
m3/min A	m3/s W	18	1.666670e-02
% (Mortality)	%(V/V)	16	1.000000e+00
NTU	JTU	15	1.000000e+00
m3/s W	m3/h	11	3.600000e+03
%	% (Mortality)	8	1.000000e+00
%(V/V)	% (Mortality)	7	1.000000e+00
%(V/V)	%	6	1.000000e+00
% (Mortality)	%	2	1.000000e+00
MPN/100mL	CFU/100mL	2	1.000000e+00
%(W/W)	% (Mortality)	1	1.000000e+00
t/d	Ton/d	1	1.102311e+00
ug/m3	mg/L	1	1.000000e-06
ug/g	ug/L	3187	NA
ueq/L	mg/L	2745	NA
CFU/100mL	L/min	1819	NA
ppm A	mg/m3	1484	NA
mg/L	mg/dm2/d	1443	NA
uS/cm	umho/cm	977	NA
ug/g	mg/L	582	NA
uS/cm	mho/cm	499	NA
mg/L	ppm W	423	NA
mg/dm2/d	ug/dm2/m	343	NA
mg/L	ug/cm2	230	NA
ug/m3	ug	223	NA
adt/d	t/d	190	NA
mg/L	%	186	NA
m3/d	E6IG/d	176	NA
mg/L	None	166	NA
mg/dm2/d	mg/dm2/m	155	NA
mg/L	mg	74	NA
m3/d	m3/w	60	NA
None	N/A	60	NA
AU/cm	mg/L	57	NA
C	F	31	NA
C	h	31	NA
mg/L	ug/g	31	NA
CFU/100mL	mg/L	27	NA
t/d	mg/L	25	NA
MPN/g	MPN/100gm	24	NA
mg/L	mg/m2	22	NA
mg/L	mg/kg	21	NA
mg/L	% Saturation	18	NA
mg/L	Vehicles	15	NA
pH units	Col. Unit	14	NA
m3/d	uS/cm	13	NA
m3/d	m3/y	12	NA
mg/L	ug	11	NA
mg/L	m3/d	9	NA
g/m2	ug/L	8	NA
MPN/100mL	mg/L	8	NA
ug/g	kg/adut	8	NA
ug/g wet	mg/L	8	NA
ug/g	None	7	NA
m3/d	m3/m	6	NA
kg/adt	mg/L	5	NA
mg/dm2/d	mg/L	5	NA
pH units	mg/L	5	NA
ug/m3	mg/dm2/d	5	NA
ug/m3	t/d	5	NA
uS/cm	mg/L	5	NA
C	N/A	4	NA
m	m3/s W	4	NA
m3/d	mV	4	NA
mg/L	N/A	4	NA
None	uS/cm	4	NA
%	ppm (S)	3	NA
MPN/100mL	ug/g	3	NA
%(W/W)	mg/L	2	NA
C	mg/L	2	NA
Col. Unit	SWU	2	NA
mg/L	MPN/100mL	2	NA
mg/L	uS/cm	2	NA
pH units	mg/g	2	NA
pH units	None	2	NA
adt/d	Col. Unit	1	NA
C	pH units	1	NA
Col. Unit	pH units	1	NA
kg/adt	% (Recovery)	1	NA
kg/adt	kg/d	1	NA
m	N/A	1	NA
m3/d	mg/L	1	NA
mg/L	C	1	NA
mg/L	kg/d	1	NA
mg/L	ml/L	1	NA
ml/L	ug/L	1	NA
MPN/100gm	F	1	NA
MPN/100mL	F	1	NA
pH units	uS/cm	1	NA
t/d	adt/d	1	NA

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

joethorley commented 3 years ago

Wow I didn't know you could set units in {units} - then that is definitely the way forward - in fact it should be used to do all the unit handling - which begs the question should the unit handling reside in rems?

ateucher commented 3 years ago

It certainly can live in rems. I think it would have to be a utility function that the user calls rather than happening automatically, as there are ways that a user can pull the data into R where we wouldn't be able to intervene with the unit conversion (e.g., calling dbGetQuery on the sqlite database). This would result in different behaviour for different ways of accessing the data

joethorley commented 3 years ago

Good point - maybe then its conceptually more consistent with the various package philosophies if it resides in wqbc?

ateucher commented 3 years ago

I've implemented the basics in rems. Unfortunately it's not that fast - about 11 seconds on 5000 rows...

joethorley commented 3 years ago

That's too bad - could it be run as part of the database creation process which is lengthy already anyway?

ateucher commented 3 years ago

It could, since for the SQLite backend the data does traverse through R - though ~2M rows with inconsistent units will need to be fixed, and if that timing scales linearly it will take over an hour! Also, when DuckDB (which is looking more and more promising) matures I do intend to use it as a backend - and it will create the db directly from the csv, so any R processing is then not an option.

joethorley commented 3 years ago

We can see how long it takes within wqbc/shinyrems workflows.

ateucher commented 3 years ago

There is an unexported function in units - ud_convert() - which is way faster than the way I am doing it now:

library(microbenchmark)
library(units)
#> udunits system database from /usr/local/share/udunits

microbenchmark(
  units:::ud_convert(1, "mg/L", "ug/L"), 
  set_units(set_units(1, "mg/L"), "ug/L")
)
#> Unit: microseconds
#>                                     expr      min        lq       mean
#>    units:::ud_convert(1, "mg/L", "ug/L")   25.729   31.9175   37.05395
#>  set_units(set_units(1, "mg/L"), "ug/L") 1995.143 2111.7080 2388.55648
#>     median       uq      max neval
#>    35.0855   41.086  145.008   100
#>  2166.6645 2521.354 6220.956   100

^{Created on 2021-02-16 by the reprex package (v1.0.0)}

I'll open an issue in units to see if Edzer would consider exporting it; otherwise in the meantime I'll use :::.

ateucher commented 3 years ago

https://github.com/r-quantities/units/issues/266

ateucher commented 3 years ago

Ok I think I have a pretty good solution at https://github.com/bcgov/rems/pull/57

joethorley commented 3 years ago

Is it fast enough to apply universally to all the data?

ateucher commented 3 years ago

It does 1M rows in < 1s, but I won't do it at the db creation stage due to the likely future duckdb migration I mentioned above.

joethorley commented 3 years ago

Confirm closed by https://github.com/bcgov/rems/releases/tag/v0.7.0

bcgov / wqbc

wqbc incorrectly assumes that MDL_UNIT is the same as UNIT in rems data #158