Closed DominiqueMakowski closed 3 years ago
No, ^ generally means "start of the line". It only means "except" inside of square brackets--"[^asdf]" means "any character that is not one of a, s, d, or f.
You want ^(?!Sepal)(.*)$
. This is a "negative lookahead". It matches anything that doesn't start with Sepal
To break that down, ^ means start of a line, the (?! ) construct is the negative lookahead. It looks for whatever pattern is inside the parentheses there--Sepal. The next () is a capture group indicating what we want to retain from the match. The . means any character and the means zero or more times, so . means any number of any characters. The $ means end of line. Together this means, "find any strings that don't start with Sepal and return the whole line.
This looks like the function is working correctly. It returned the only parameter that matches your regex.
Another thing to note--regex works one character at a time. If you want some modifier to apply to multiple characters, you need to wrap them in ().
The secret source of my regex knowledge is https://regex101.com. I test every pattern I write there.
I see, my bad.
we really should come up with some wrapper like easyregex(starts_with = NULL, ends_with = NULL, contains = NULL, except = NULL, ...)
for the simple and common usecases ^^
In practice the verbosity of the pattern for exclusions is really frustrating, it really begs the question of having some form of except
shortcut or some solution to that
Turns out part of my frustration was due to the fact that in fact it doesn't work at all with Bayesian models š :
library(brms)
model <- lm(Sepal.Length ~ ., data = iris)
parameters::parameters(model, parameters = "^(?!Petal)(.*)$")
#> Parameter | Coefficient | SE | 95% CI | t(144) | p
#> ----------------------------------------------------------------------------
#> (Intercept) | 2.17 | 0.28 | [ 1.62, 2.72] | 7.76 | < .001
#> Sepal.Width | 0.50 | 0.09 | [ 0.33, 0.67] | 5.76 | < .001
#> Species [versicolor] | -0.72 | 0.24 | [-1.20, -0.25] | -3.01 | 0.003
#> Species [virginica] | -1.02 | 0.33 | [-1.68, -0.36] | -3.07 | 0.003
model <- brms::brm(Sepal.Length ~ ., data = iris, refresh = 0)
parameters::parameters(model, parameters = "^(?!Petal)(.*)$")
#> # Fixed effects
#>
#> Parameter | Median | 89% CI | pd | % in ROPE | Rhat | ESS
#> ----------------------------------------------------------------------------------
#> (Intercept) | 2.17 | [ 1.73, 2.62] | 100% | 0% | 1.003 | 3070.00
#> Sepal.Width | 0.50 | [ 0.37, 0.64] | 100% | 0% | 1.004 | 2236.00
#> Petal.Length | 0.83 | [ 0.72, 0.93] | 100% | 0% | 1.001 | 1756.00
#> Petal.Width | -0.32 | [-0.57, -0.08] | 97.38% | 5.95% | 1.002 | 2374.00
#> Speciesversicolor | -0.72 | [-1.12, -0.35] | 99.85% | 0.40% | 1.003 | 1385.00
#> Speciesvirginica | -1.02 | [-1.60, -0.52] | 99.83% | 0.20% | 1.004 | 1378.00
#>
#> # Fixed effects sigma
#>
#> Parameter | Median | 89% CI | pd | % in ROPE | Rhat | ESS
#> ----------------------------------------------------------------------
#> sigma | 0.31 | [0.28, 0.34] | 100% | 0% | 1.000 | 3126.00
#>
#> Using highest density intervals as credible intervals.
Created on 2021-06-06 by the reprex package (v1.0.0)
model <- brms::brm(Sepal.Length ~ ., data = iris, refresh = 0)
#> Compiling Stan program...
#> Start sampling
parameters::parameters(model, parameters = "^(?!b_Petal)(.*)$")
#> # Fixed effects
#>
#> Parameter | Median | 89% CI | pd | % in ROPE | Rhat | ESS
#> ----------------------------------------------------------------------------------
#> (Intercept) | 2.17 | [ 1.73, 2.60] | 100% | 0% | 1.000 | 2763.00
#> Sepal.Width | 0.50 | [ 0.37, 0.64] | 100% | 0.02% | 1.002 | 2092.00
#> Speciesversicolor | -0.71 | [-1.10, -0.33] | 99.92% | 0.32% | 1.007 | 1211.00
#> Speciesvirginica | -1.00 | [-1.51, -0.43] | 99.95% | 0.25% | 1.007 | 1243.00
#>
#> # Fixed effects sigma
#>
#> Parameter | Median | 89% CI | pd | % in ROPE | Rhat | ESS
#> ----------------------------------------------------------------------
#> sigma | 0.31 | [0.28, 0.34] | 100% | 0% | 1.001 | 3172.00
#>
#> Using highest density intervals as credible intervals.
Created on 2021-06-06 by the reprex package (v2.0.0)
I think the easiest thing would be to add two arguments, one for matching a pattern and one for not matching the pattern (or one pattern argument and an argument saying match/anti). That would cover this case well enough.
Yes, that'd be good.
Which makes me think that in line with this https://github.com/easystats/datawizard/issues/4, we should probably settle for a consistent naming of these selection/filtering arguments.
Throughout easystats (e.g., standardize()), we currently use select/exclude
for filtering columns, and I think it's good. So for rows, we should have something different, like... keep/except
? The problem of filter
is that it's a bit ambiguous as to whether it is filtering out or "in". Or subset/except
to get closer to base R? Even though subset is ambiguous too imo.
We have had the parameters
argument for quite a long time in insight now - though restricted to Bayesian models.
yeah but we could in principle add a parameters_except
one no?
I think if we decide for regular expression patterns, we should not have an "anti" or "except" argument, because that might lead to confusion. If I can skip parameters using regexp, how would the regexp look like to exclude parameters? Selecting parameters, because the argument is an exclude argument?
This pattern: "^(?!b_Petal)(.*)$"
excludes all parameters starting with "b_Petal", however, if we have an exclude argument, does the pattern then need to be "selecting" the to be "excluded" parameters, like "^b_Petal(.*)$"
?
If we have a logical "anti" argument, we must check for exclusion patterns, like "^(?!b_Petal)(.*)$"
, and then create a kind of reverse pattern that will be "^b_Petal(.*)$"
.
I'm open to the idea having something similar to the select/exclude pairs, but I cannot image right now how this would look like if we allow regular expressions.
mmh right. If we keep the pair parameters/except
, then it seems like parameters would be sort of "prioritized", so we could imagine the following logic?
parameters <- paste0("(?!pattern)")
.except
and assemble the two patterns with OR, so it's either the pattern in parmaeters
OR not what is in except
I think you're overthinking it @strengejacke. The argument would just be "all matching the pattern" or "all not matching the pattern". So on our end it would just be to !
the result of the pattern. Whether the user uses some of the more elaborate regex structures isn't really an issue
it would just be to ! the result of the pattern
Right yes we don't even need to wrap in the negative regex pattern!
I think you're overthinking it @strengejacke. The argument would just be "all matching the pattern" or "all not matching the pattern". So on our end it would just be to
!
the result of the pattern. Whether the user uses some of the more elaborate regex structures isn't really an issue
True! So we have something like parameters_match = c("select", "exclude")
?
Yeah, exactly.
but instead of a toggle-argument, having two separate arguments parameters/parameters_except
is more elegant imo
Hmmm. I disagree. I think "pattern" and "match" (include/exclude) is more intuitive. Especially because I don't think we would want two different patterns operating at once.
but this would require two lengthy arguments to exclude a pattern (parameters = "something", parameters_match = "exclude"
) which is very verbose. I think we could deal with dual patterns in parameters
and parameters_except
by either erroring, warning ("You specified both a pattern to match and to exclude. 'parameters' will be prioritized and 'parameters_except' will be omitted.") or by simply 1) selecting according to parameters and then excluding according to the exception (which would allow for instance to easily exclude patterns from the subset that is selected - e.g., include only starting with "Petal" and exclude all ending with "Length").
I think this issue is still not being thought through the end.
parameters
and a switch
argument (that negates the matched pattern, like parameters_match = c("select", "exclude")
).parameters
and parameters_except
, it's close to select
/exclude
for other functions - so why do we need two new argument names then? To have consistent API, we should use the same argument names.select
/exclude
usually refer to columns, while in this case, it refers to rows. This could be a reason against using select
/exclude
.Maybe we should for now just leave it as it is.
select/exclude usually refer to columns, while in this case, it refers to rows. This could be a reason against using select/exclude.
Yes this was my argument with thinking of easystats/datawizard#4, that we should have distinct names for rows and columns. For the latter we have select/exclude, so keep/except sounded like a good alternative for rows.
I find parameters(model, except = "Petal.*")
and parameters(model, keep = "Sepal.*")
pretty neat. But then in order to avoid a breaking change we could keep parameters
for now (as an alias for keep
).
The logic with two arguments is pretty straightforward, first, you keep all the parameters that match keep
, and then you keep all parameters that negate (!
) except
. So there is no conflict if the two are specified at the same time
I just made a commit with a documentation specification but it turns out that it doesn't work š¬ which leaves me confused:
Created on 2021-06-06 by the reprex package (v1.0.0)
I thought that "^something" is the regex code for everything except, but it doesn't seem to be that. Can you skip specific parameters?