easystats / parameters

:bar_chart: Computation and processing of models' parameters
https://easystats.github.io/parameters/
GNU General Public License v3.0
438 stars 36 forks source link

parameters argument: skip specific parameters from output #522

Closed DominiqueMakowski closed 3 years ago

DominiqueMakowski commented 3 years ago

I just made a commit with a documentation specification but it turns out that it doesn't work šŸ˜¬ which leaves me confused:

model <- lm(Sepal.Length ~ ., data = iris)
parameters::parameters(model)
#> Parameter            | Coefficient |   SE |         95% CI | t(144) |      p
#> ----------------------------------------------------------------------------
#> (Intercept)          |        2.17 | 0.28 | [ 1.62,  2.72] |   7.76 | < .001
#> Sepal.Width          |        0.50 | 0.09 | [ 0.33,  0.67] |   5.76 | < .001
#> Petal.Length         |        0.83 | 0.07 | [ 0.69,  0.96] |  12.10 | < .001
#> Petal.Width          |       -0.32 | 0.15 | [-0.61, -0.02] |  -2.08 | 0.039 
#> Species [versicolor] |       -0.72 | 0.24 | [-1.20, -0.25] |  -3.01 | 0.003 
#> Species [virginica]  |       -1.02 | 0.33 | [-1.68, -0.36] |  -3.07 | 0.003

parameters::parameters(model, parameters = "^Sepal")
#> Parameter   | Coefficient |   SE |       95% CI | t(144) |      p
#> -----------------------------------------------------------------
#> Sepal.Width |        0.50 | 0.09 | [0.33, 0.67] |   5.76 | < .001

Created on 2021-06-06 by the reprex package (v1.0.0)

I thought that "^something" is the regex code for everything except, but it doesn't seem to be that. Can you skip specific parameters?

bwiernik commented 3 years ago

No, ^ generally means "start of the line". It only means "except" inside of square brackets--"[^asdf]" means "any character that is not one of a, s, d, or f.

You want ^(?!Sepal)(.*)$. This is a "negative lookahead". It matches anything that doesn't start with Sepal

To break that down, ^ means start of a line, the (?! ) construct is the negative lookahead. It looks for whatever pattern is inside the parentheses there--Sepal. The next () is a capture group indicating what we want to retain from the match. The . means any character and the means zero or more times, so . means any number of any characters. The $ means end of line. Together this means, "find any strings that don't start with Sepal and return the whole line.

bwiernik commented 3 years ago

This looks like the function is working correctly. It returned the only parameter that matches your regex.

bwiernik commented 3 years ago

Another thing to note--regex works one character at a time. If you want some modifier to apply to multiple characters, you need to wrap them in ().

bwiernik commented 3 years ago

The secret source of my regex knowledge is https://regex101.com. I test every pattern I write there.

DominiqueMakowski commented 3 years ago

I see, my bad.

we really should come up with some wrapper like easyregex(starts_with = NULL, ends_with = NULL, contains = NULL, except = NULL, ...) for the simple and common usecases ^^

DominiqueMakowski commented 3 years ago

In practice the verbosity of the pattern for exclusions is really frustrating, it really begs the question of having some form of except shortcut or some solution to that

DominiqueMakowski commented 3 years ago

Turns out part of my frustration was due to the fact that in fact it doesn't work at all with Bayesian models šŸ˜…:

library(brms)

model <- lm(Sepal.Length ~ ., data = iris)
parameters::parameters(model, parameters = "^(?!Petal)(.*)$")
#> Parameter            | Coefficient |   SE |         95% CI | t(144) |      p
#> ----------------------------------------------------------------------------
#> (Intercept)          |        2.17 | 0.28 | [ 1.62,  2.72] |   7.76 | < .001
#> Sepal.Width          |        0.50 | 0.09 | [ 0.33,  0.67] |   5.76 | < .001
#> Species [versicolor] |       -0.72 | 0.24 | [-1.20, -0.25] |  -3.01 | 0.003 
#> Species [virginica]  |       -1.02 | 0.33 | [-1.68, -0.36] |  -3.07 | 0.003

model <- brms::brm(Sepal.Length ~ ., data = iris, refresh = 0)
parameters::parameters(model, parameters = "^(?!Petal)(.*)$")
#> # Fixed effects
#> 
#> Parameter         | Median |         89% CI |     pd | % in ROPE |  Rhat |     ESS
#> ----------------------------------------------------------------------------------
#> (Intercept)       |   2.17 | [ 1.73,  2.62] |   100% |        0% | 1.003 | 3070.00
#> Sepal.Width       |   0.50 | [ 0.37,  0.64] |   100% |        0% | 1.004 | 2236.00
#> Petal.Length      |   0.83 | [ 0.72,  0.93] |   100% |        0% | 1.001 | 1756.00
#> Petal.Width       |  -0.32 | [-0.57, -0.08] | 97.38% |     5.95% | 1.002 | 2374.00
#> Speciesversicolor |  -0.72 | [-1.12, -0.35] | 99.85% |     0.40% | 1.003 | 1385.00
#> Speciesvirginica  |  -1.02 | [-1.60, -0.52] | 99.83% |     0.20% | 1.004 | 1378.00
#> 
#> # Fixed effects sigma
#> 
#> Parameter | Median |       89% CI |   pd | % in ROPE |  Rhat |     ESS
#> ----------------------------------------------------------------------
#> sigma     |   0.31 | [0.28, 0.34] | 100% |        0% | 1.000 | 3126.00
#> 
#> Using highest density intervals as credible intervals.

Created on 2021-06-06 by the reprex package (v1.0.0)

strengejacke commented 3 years ago
model <- brms::brm(Sepal.Length ~ ., data = iris, refresh = 0)
#> Compiling Stan program...
#> Start sampling
parameters::parameters(model, parameters = "^(?!b_Petal)(.*)$")
#> # Fixed effects
#> 
#> Parameter         | Median |         89% CI |     pd | % in ROPE |  Rhat |     ESS
#> ----------------------------------------------------------------------------------
#> (Intercept)       |   2.17 | [ 1.73,  2.60] |   100% |        0% | 1.000 | 2763.00
#> Sepal.Width       |   0.50 | [ 0.37,  0.64] |   100% |     0.02% | 1.002 | 2092.00
#> Speciesversicolor |  -0.71 | [-1.10, -0.33] | 99.92% |     0.32% | 1.007 | 1211.00
#> Speciesvirginica  |  -1.00 | [-1.51, -0.43] | 99.95% |     0.25% | 1.007 | 1243.00
#> 
#> # Fixed effects sigma
#> 
#> Parameter | Median |       89% CI |   pd | % in ROPE |  Rhat |     ESS
#> ----------------------------------------------------------------------
#> sigma     |   0.31 | [0.28, 0.34] | 100% |        0% | 1.001 | 3172.00
#> 
#> Using highest density intervals as credible intervals.

Created on 2021-06-06 by the reprex package (v2.0.0)

bwiernik commented 3 years ago

I think the easiest thing would be to add two arguments, one for matching a pattern and one for not matching the pattern (or one pattern argument and an argument saying match/anti). That would cover this case well enough.

DominiqueMakowski commented 3 years ago

Yes, that'd be good.

Which makes me think that in line with this https://github.com/easystats/datawizard/issues/4, we should probably settle for a consistent naming of these selection/filtering arguments.

Throughout easystats (e.g., standardize()), we currently use select/exclude for filtering columns, and I think it's good. So for rows, we should have something different, like... keep/except? The problem of filter is that it's a bit ambiguous as to whether it is filtering out or "in". Or subset/except to get closer to base R? Even though subset is ambiguous too imo.

strengejacke commented 3 years ago

We have had the parameters argument for quite a long time in insight now - though restricted to Bayesian models.

DominiqueMakowski commented 3 years ago

yeah but we could in principle add a parameters_except one no?

strengejacke commented 3 years ago

I think if we decide for regular expression patterns, we should not have an "anti" or "except" argument, because that might lead to confusion. If I can skip parameters using regexp, how would the regexp look like to exclude parameters? Selecting parameters, because the argument is an exclude argument?

This pattern: "^(?!b_Petal)(.*)$" excludes all parameters starting with "b_Petal", however, if we have an exclude argument, does the pattern then need to be "selecting" the to be "excluded" parameters, like "^b_Petal(.*)$"?

If we have a logical "anti" argument, we must check for exclusion patterns, like "^(?!b_Petal)(.*)$", and then create a kind of reverse pattern that will be "^b_Petal(.*)$".

I'm open to the idea having something similar to the select/exclude pairs, but I cannot image right now how this would look like if we allow regular expressions.

DominiqueMakowski commented 3 years ago

mmh right. If we keep the pair parameters/except, then it seems like parameters would be sort of "prioritized", so we could imagine the following logic?

bwiernik commented 3 years ago

I think you're overthinking it @strengejacke. The argument would just be "all matching the pattern" or "all not matching the pattern". So on our end it would just be to ! the result of the pattern. Whether the user uses some of the more elaborate regex structures isn't really an issue

DominiqueMakowski commented 3 years ago

it would just be to ! the result of the pattern

Right yes we don't even need to wrap in the negative regex pattern!

strengejacke commented 3 years ago

I think you're overthinking it @strengejacke. The argument would just be "all matching the pattern" or "all not matching the pattern". So on our end it would just be to ! the result of the pattern. Whether the user uses some of the more elaborate regex structures isn't really an issue

True! So we have something like parameters_match = c("select", "exclude")?

bwiernik commented 3 years ago

Yeah, exactly.

DominiqueMakowski commented 3 years ago

but instead of a toggle-argument, having two separate arguments parameters/parameters_except is more elegant imo

bwiernik commented 3 years ago

Hmmm. I disagree. I think "pattern" and "match" (include/exclude) is more intuitive. Especially because I don't think we would want two different patterns operating at once.

DominiqueMakowski commented 3 years ago

but this would require two lengthy arguments to exclude a pattern (parameters = "something", parameters_match = "exclude") which is very verbose. I think we could deal with dual patterns in parameters and parameters_except by either erroring, warning ("You specified both a pattern to match and to exclude. 'parameters' will be prioritized and 'parameters_except' will be omitted.") or by simply 1) selecting according to parameters and then excluding according to the exception (which would allow for instance to easily exclude patterns from the subset that is selected - e.g., include only starting with "Petal" and exclude all ending with "Length").

strengejacke commented 3 years ago

I think this issue is still not being thought through the end.

Maybe we should for now just leave it as it is.

DominiqueMakowski commented 3 years ago

select/exclude usually refer to columns, while in this case, it refers to rows. This could be a reason against using select/exclude.

Yes this was my argument with thinking of easystats/datawizard#4, that we should have distinct names for rows and columns. For the latter we have select/exclude, so keep/except sounded like a good alternative for rows.

I find parameters(model, except = "Petal.*") and parameters(model, keep = "Sepal.*") pretty neat. But then in order to avoid a breaking change we could keep parameters for now (as an alias for keep).

The logic with two arguments is pretty straightforward, first, you keep all the parameters that match keep, and then you keep all parameters that negate (!) except. So there is no conflict if the two are specified at the same time