Open lionel- opened 9 years ago
Vitalie, do you think the current framework could be arranged to add such new sources of completion?
Everything could be done. It's not particularly difficult. I was holding on this so far because I was planing to implement wisent parser for R directly in emacs. Then build such functionality on top of that parser.
It could be done without of course. For instance ess--funname.start
find the out call and eldoc, ac and company use its side effect of setting ess--funname.start
variable.
Is R-studio completion system exposed in a package somewhere? If so we can simply piggyback on that just as we are relying on internal R completion right now.
Is R-studio completion system exposed in a package somewhere? If so we can simply piggyback on that just as we are relying on internal R completion right now.
I was wondering the same. I'll investigate.
I was planing to implement wisent parser for R directly in emacs.
Sounds interesting! How does this work? Would it rely on R's gram.y?
What do you plan to accomplish with this? Would this system overlap with all the syntax traveller functions I've implemented for indentation and code filling?
While I was thinking about how to implement this, I thought it'd probably be easier to work at the R level for piping stuff. For example, evaluate the following code in a special env with our own definitions of ggplot and co:
df %>%
ggplot(aes(x = , y = )) +
geom_point(aes(colour =)
This way we don't have to care about how the data comes into contact with the relevant function. I.e., this would automatically handle weird corner cases like this:
aes <- aes(x = var, y = var2)
aes %>%
ggplot(data = df) +
geom_point(aes(colour = ))
With some safeguards in place to make sure we don't evaluate potentially resource-consuming functions.
For example, evaluate the following code in a special env with our own definitions of ggplot and co:
Sounds interesting, but wouldn't that be too much work? I am not sure that would work. I commonly have ggplot construct encapsulated into a function my_plot
and then do my_plot() + facet_grid(..)
. In such a case, due to lexical scoping, re-defining ggplot will not work.
Our task is to infer what is available at some level of computation, right. We can have a small function cp()
( standing for current point) which, when called will climb the dynamic call stack and will collect all available variables there. For example:
df %>%
ggplot(aes(x = , y = )) +
geom_point(aes(colour = |
Here | is the current pointer which we replace with cp()
. Now the cp
must be called and that means you need to close all the parenthesis and brakets in order to make it a valid call.
Completing calls to something valid is quite tricky because of the lazy evaluation in R. If you place cp() into an argument that is not evaluated before the first error occurs you won't get anything.
BTW I still don't understand how RStudio fixed the evaluation. Assume df
above is a function df()
which takes 30 minutes. What RStudio folks do about that? Evaluate df()
and wait for completion?
I commonly have ggplot construct encapsulated into a function my_plot. In such a case, due to lexical scoping, re-defining ggplot will not work.
For this situation we can detect a ggplot continuation, eval the first expression if it's a function and get the top-level data from the built plot.
We can have a small function cp() ( standing for current point) which, when called will climb the dynamic call stack
Hmm I think a simpler approach should work.
Assume df above is a function df() which takes 30 minutes.
I guess the safest course is never to eval functions. Or could we check the time taken by a function and kill it if it's too long? Actually... I'd also be concerned with potential side effects. So I think we should never evaluate unknown functions.
We probably should limit ourselves to inferring completions from top-level data.
BTW I still don't understand how RStudio fixed the evaluation.
My guess is that they don't evaluate functions, only objects. I'm looking into how they do it. A good place to start is the following Github search: is:pr completion
.
Their R completion tools are in this file: https://github.com/rstudio/rstudio/blob/master/src/cpp/session/modules/SessionRCompletions.R
And here is a PR providing a bit of context: https://github.com/rstudio/rstudio/pull/191
My guess is that they don't evaluate functions, only objects.
Indeed, objects are looked up with getAnywhere()
which only works with names, see https://github.com/rstudio/rstudio/blob/master/src/cpp/session/modules/SessionCodeTools.R
Also they don't try to resolve the scope for the object and just get it from top level. I think we should do the same, it's impossible to know where and how a function is called so we shouldn't try.
Another important file is https://github.com/rstudio/rstudio/blob/master/src/gwt/src/org/rstudio/studio/client/workbench/views/console/shell/assist/RCompletionManager.java
Notably for the getAutoCompletionContext()
method, which decides which part of the R code is relevant to the completion at hand.
So they are not using R internal completion at all?
In RCompletionManager they are doing parsing from scratch. If we had a wisent parser we would have all those out of the box.
Also they don't try to resolve the scope for the object and just get it from top level. I think we should do the same, it's impossible to know where and how a function is called so we shouldn't try.
We need to look in the current environment which could be a debugger. I think getAnywhere doesn't do that. This is why our current completion doesn't work that well in the debugger.
So they are not using R internal completion at all?
No they completely replaced it with last year's rewrite.
If we had a wisent parser we would have all those out of the box.
Yes but the syntax motions give us the same kind of functionality. For example it should be easy to substring the continuation at point, check which operators and which functions are used in the continuations, etc. It'd probably be great to have both, but if it's a lot of work we can use functions based on the syntax travellers for the time being.
We need to look in the current environment which could be a debugger.
Indeed.
I think getAnywhere doesn't do that.
I saw somewhere in the code that they try to be aware of the "current frame". I think it's about the debugger. getAnywhere()
uses the calling frame, which could be the debugging context or a child environment.
As I can see their SessionRCompletions.R is almost self contained with the exception of Tools.R.
I think it would be wise to find a way to shamelessly re-use their sessionRCompletion.R
. The problem is that they depend on a bunch of Cpp routines.
tSourceIndexCompletions, token)
SessionRCompletions.R:1342: .Call(.rs.routines$rs_getNAMESPACEImportedSymbols, documentId)
SessionRCompletions.R:1435: .Call(.rs.routines$rs_finishExpression, as.character(string))
SessionRCompletions.R:1470: .Call(.rs.routines$rs_isBrowserActive)
SessionRCompletions.R:1494: .Call(.rs.routines$rs_getActiveFrame, as.integer(n) + offset)
SessionRCompletions.R:1550: .Call(.rs.routines$rs_getKnitParamsForDocument, documentId)
SessionRCompletions.R:2841: .Call(.rs.routines$rs_listInferredPackages, documentId)
SessionRCompletions.R:2847: result <- .Call(.rs.routines$rs_getInferredCompletions, as.character(packages))
Which in turn depend on the entire RStudio ecosystem. Using compiled code is out of question anyways, we want ESS to be used with any R program locally or remotely.
There are two approaches. We can start modifying that file and effectively hard fork it. Or, alternatively, automatically pre-process it by leaving only what works. Then we could pull new versions of their file before each ESS release.
The actual dispatch is happening through the requester's getCompletions
but I cannot see how is all that stuff instantiated. Particularly how is the context around point inspected and passed to underlying R routines. Have you figured that out?
Another dependency is SessionCodeTools.R
The actual dispatch is happening through the requester's getCompletions but I cannot see how is all that stuff instantiated. Particularly how is the context around point inspected and passed to underlying R routines. Have you figured that out?
I think they just pass the surrounding code as a string. In addition they pass the current frame and that's that. Some of the code to determine the context is written in javascript btw: https://github.com/rstudio/rstudio/blob/master/src/gwt/acesupport/acemode/r_code_model.js
I think it would be wise to find a way to shamelessly re-use their sessionRCompletion.R
I think that'll be difficult. To begin with we'd need to reimplement precisely how their java and javascript functions snip R code as a string. That'd be a nightmare to maintain.
And I feel my idea of writing dummy versions of lm()
, ggplot()
etc would lead to a simpler design than RStudio's (though I may be completely wrong about that—maybe it won't work).
And what if we want to support something that's not handled by RStudio? For instance they don't have ggplot completions yet.
I think that'll be difficult. To begin with we'd need to reimplement precisely the java and javascript functions they use to snip R code as a string. That'd be a nightmare to maintain.
It depends on what they are passing to underlying functions. If R functions handle the string then we are done, but it looks like they are parsing the string at java level.
Actually a lot of their code is not needed. Like file completion for file names. But they handle knitr and shiny completion.
That'd be a nightmare to maintain.
Yeh. Their code base is huge. Too much stuff to keep track of. Let's roll our own but keep an eye on their code. The dplyr thing is here.
And I feel my idea of writing dummy versions of lm(), ggplot()
You want to write dummy versions for all possible model functions?
We can do something much simpler. We can figure out at elisp level the symbol which provides the evaluation frame (like df
in df %>% ...
, dt
in dt[...]
, my_data
in, foocall(..., data = my_data)
, what else?) and pass it along to our R level completion engine.
I think this is aproximate but super simple, and should do 99% of the cases right.
I think this is aproximate but super simple, and should do 99% of the cases right.
yes let's start with that and see if we need more later. You're right that we shouldn't go overboard and implement something complex to handle the last 1% of the cases.
We can't have accurate completions inside functions anyway so the whole stuff is necessarily approximate.
We can also complete the string surrounding point to a valid expression by appending )]}. Then pass that to R and parse to get the tree. Then call all.vars
on it and get all elementary symbols within that expression. Then inspect if any of those are defined and are containers. And propose all completions from all those containers. This will likely give more completions in some situations but will always give at least all the necessary completions. It's also very easy to implement and 99% is accurate with 0 false negatives.
This will likely be quite simple and robust because you can inspect the call tree in R with things like %>%
or [
being uniformly translated as calls
> as.list(quote(dd %>% group_by(...)))
[[1]]
`%>%`
[[2]]
dd
[[3]]
group_by(...)
> as.list(quote(df[, subset]))
[[1]]
`[`
[[2]]
df
[[3]]
[[4]]
subset
I think evaluating code (such as %>%
pipes) is out of question. It will lead to nightmarish corner cases. I think, the most we should do is to evaluate first argument of a function call (but only if it's a symbol) and then provide arguments for that method.
I think evaluating code (such as %>% pipes) is out of question. It will lead to nightmarish corner cases. I think, the most we should do is to evaluate first argument of a function call (but only if it's a symbol) and then provide arguments for that method.
I was thinking the same, with additional arrangements for dplyr and ggplot. If you can get it working for the base cases I can implement the latter two.
If you can get it working for the base cases I can implement the latter two.
Ok. I have written the current completion system so let me give it a first stab. Will streamline the code on this occasion.
Can we re-use some of your pieces from indentation to get a reasonably complete expression surrounding point?
You want to write dummy versions for all possible model functions? We can do something much simpler.
Also note that completions should depend on which argument is active. E.g., we only want dataset completions if we are writing a formula, not when we are specifying a glm family etc. And they should be available in subcalls like I()
.
Can we re-use some of your pieces from indentation to get a reasonably complete expression surrounding point?
Something like (ess-continuations-bounds)
should be a good starting point (note that end
is returned as a marker so you may need a slightly different version. We could add an argument to get it as a position instead).
Let me know if you need something more specific.
we only want dataset completions if we are writing a formula,
it will not work unless you have already typed ~
but I guess we can live with that.
Something like (ess-continuations-bounds)
Perfect!
Perfect!
Don't speak too soon :)
(ess-climb-outside-call)
for that purpose.-bounds
function return a position by default for end
, and a marker optionally.here you go.
With ess-climb-outside-calls
it works much better indeed, but still fails with incomplete forms. For example with mtcars %>% dplyr::select(mp |
the cursor won't move.
yes this was intended. Why should we support broken calls?
Should we rearrange the source a bit?
All syntax travellers and predicates could be moved to ess-r-syntax.el
and indentation / code filling to ess-r-editing.el
(the latter may need a better name).
yes this was intended.
Yep. This makes sense for indentation, but for completion we need to give completions even if the trailing )
is missing. I probably won't need any further adjustments as I can jump to previous open delim myself.
Should we rearrange the source a bit?
Good idea. But I think inndentaiton should be part of R-mode.el which in ESS jargon is ess-r-d.el.
oh it would be great to rename to R-mode.el
or maybe ess-r-mode.el
as part of the future refactoring of ESS. Those ess-x-y.el
files are confusing.
Yep. This makes sense for indentation, but for completion we need to give completions even if the trailing ) is missing.
The problem is with finding the bounds. Where do we stop? If you step outside a complete call this is fine, but what if the top level call is incomplete?
Right. That's the problem. Here is a concrete example
dt[, { | }]
This is definitely tricky because there could be an arbitrary nesting inside a data table call.
I think we need to step out of all { till ){
construct which indicates a function definition.
What kind of use cases do you have in mind?
I can write a (ess-climb-outside-defun)
, then you can use
(ess-while (or (ess-climb-outside-call)
(ess-climb-outside-defun))
(ess-while)
is like (while)
except it returns t
when the condition was non-nil at least once.
Edit: though in this case you don't need this return value and can use the normal (while)
I added (ess-climb-outside-defun)
.
I am working on this right now. Fixed the climbing part for now. Will come back on this once I am done with the first phase.
Here is the locally working prototype of the new completion api. It already works for [
, ~
and with
functions. Some rough R level examples are examples. Those should end up as formal tests eventually. If you want to add some ideas on what and when should be completed use this temporary file for that. No connection with emacs as yet.
As much as I want to finish this I probably won't have time for this till monday
looks great! I'm going away until Monday so I'll check it out properly next week. Have a good weekend
Maybe supporting within
(similarly to with
) would make sense.
Less sure about subset()
and transform()
I'm trying to get up to date with ESS completion code but I still have some way to go.
In the meantime Vitalie, here is a prototype of what I had in mind, in case you find some ideas useful:
is_active_arg <- function(expr) {
if (!is.symbol(expr)) {
return(FALSE)
}
expr <- as.character(expr)
grepl("._.$", expr)
}
find_active_sel <- function(expr) {
if (is_active_arg(expr)) {
gsub("._.$", "", as.character(expr))
} else if (is.call(expr)) {
unlist(lapply(expr[-1], find_active_sel))
}
}
narrow_completions <- function(active, data) {
candidates <- names(data)
matches <- grepl(paste0("^", active), candidates)
candidates[matches]
}
# Really rudimentary skeleton, but the idea is to avoid arbitrary
# evaluations
check_call <- function(expr) {
if (expr[[1]] == quote(`%>%`)) {
is.symbol(expr[[2]])
} else {
TRUE
}
}
eval_completions <- function(expr) {
if (check_call(expr)) {
eval(expr, .ess_compl_env)
}
}
# Environment containing custom functions: lm(), ggplot() etc
.ess_compl_env <- new.env(parent = globalenv())
# Should be generated with a function factory that would handle all
# known modelling functions
.ess_compl_env$lm <- function(formula, data = NULL, ...) {
formula <- substitute(formula)
active_sel <- find_active_sel(formula)
if (!is.null(active_sel) && !is.null(data)) {
narrow_completions(active_sel, data)
}
}
### Examples:
# Returns c("disp", "drat")
quote(lm(I(d._.), mtcars)) %>%
eval_completions()
# Returns c("disp", "drat")
quote(mtcars %>% lm(formula = I(d._.))) %>%
eval_completions()
# Retuns NULL: no arbitrary evaluation
quote(fun_call() %>% lm(formula = I(d._.))) %>%
eval_completions()
# Returns NULL: no data
quote(lm(I(d._.))) %>%
eval_completions()
# Returns c("disp", "drat")
quote(lm(disp ~ I(d._.) * cyl, mtcars)) %>%
eval_completions()
I only implemented lm()
. It checks that:
This works with all subcalls such as I()
.
How about completions from I()? Or from I(foo(boo(data, ....|))
?
I will have a look once I start working on it again, thanks. Unfortunately I am super busy this week but will try to have a glance.
Would be good not to do double work. Either way we will be converging towards the same thing, some kind of parsing of sub-expressions.
I think using lm handler with the same arguments as lm is not flexible enough. I have in mind something more general. Handler must have access to the entire parsing tree and also to the completions generated by handlers down the parsed tree. This allows for full control over how sub expressions are combined and what parts of that expression are evaluated.
How about completions from I()? Or from I(foo(boo(data, ....|))?
They're both handled.
I think using lm handler with the same arguments as lm is not flexible enough.
Conceptually we just need to indicate which argument gets the additional data and for which arguments these additional completions should be activated. These should be parameters of a factory function.
Handler must have access to the entire parsing tree and also to the completions generated by handlers down the parsed tree. This allows for full control over how sub expressions are combined and what parts of that expression are evaluated.
You're right. That would allow chained subsetting to narrow down the completions right? Good idea. I'll try to see if my proposal can handle that approach.
Would be good not to do double work.
Indeed. But it's good to test different designs early so we can take what works best and is most flexible.
also to the completions generated by handlers down the parsed tree.
We should have this information as a structured object. It would contain the origin of each completions (global env, function bindings, additional data, etc), so that each handler can decide how to prioritize each type of completions. i.e., lm()
would display data completions first.
We should have this information as a structured object. It would contain the origin of each completions (global env, function bindings, additional data, etc), so that each handler can decide how to prioritize each type of completions. i.e., lm() would display data completions first.
I think it's a bit premature to discuss implementation details such as these. We don't even have a set of concrete targets. Those completion-ideas.R is all that we have. If you have time to spare, enhancing that file would be a great way to speed up the whole process.
In any case, my version is 99% finished on R side and it's 250 lines of code as compared to thousands in R-studio and is probably more powerful because it does not do any add-hock parsing. I will finish it and if you think it could be improved or rewritten be my guest. let's do things step by step. I am in an extreme lack of time these days, so I cannot get into many pros/cons discussion.
Should be generated with a function factory that would handle all known modelling functions
This will not fly. There are probably 1000s modelin function. You propose to generate handelrs for all of them? Dispatching on ~
is the only meaningful way to go here.
let's do things step by step. I am in an extreme lack of time these days, so I cannot get into many pros/cons discussion.
I understand.
This will not fly. There are probably 1000s modelin function. You propose to generate handelrs for all of them? Dispatching on ~ is the only meaningful way to go here.
I disagree. We don't need to handle 1000 modelling functions, just a handful. And dispatching on ~
generally does not make sense, ~
is used for many things besides model formulaes. It's a quoting mechanism just like '
in lisp. For example it's used in purrr to create lambda functions, or in dplyr to create non-special evaluation calls.
Edit:
You propose to generate handelrs for all of them?
You're right that we should have only one model handler function. This is easily achieved, we just need a list of all handled modelling functions and their respective relevant arguments. This list would be consulted by an expression walker to change all model calls to our model handler.
On Mon, Oct 12 2015 18:42, Lionel Henry wrote: I disagree. We don't need to handle 1000 modelling functions, just a handful.
?? So you will be implementing completion for your favorite modeling functions? We need to handle not just core packages but many others.
What if I do lm1 <- lm? Why would I loose completion because of that?
And dispatching on
~
generally does not make sense,~
is used for many things besides model formulaes.
Right now on ~ dispatch I am checking for "data" parameter if it's there the names are picked, if not nothing is completed. This will work in most of the cases. In some extreme cases (if there will be any) next function up the stack will eliminate the completions from ~. Or we can introduce a generic that will be called inside ~ handler and dispatched on the first argument of the enclosing function.
What I have already implemented is a uniformly more general approach than what you propose. Everything that you can do with explicit "rebinding" I can already do with handler stack. I can add a handler for any function just like in your approach above, but I can also do much more than that. Handlers can communicate between them and exchange messages and modify completions of each other, inhibit each other or even modify the expression that other handlers will see.
Additional advantage of handlers is that they provide uniform api. In your approach each function will have different arguments. It's not immediately clear what you will do with generics that allow different arguments.
We need to handle not just core packages but many others.
The way I see it it's not a lot of work to complete a list of function names and arguments.
What I have already implemented is a uniformly more general approach than what you propose.
I think all the features you mention can be accomplished. It seems to me that the approach of using R internal evaluation makes it easier to do all that, including communication between handlers etc (though I'm not 100% certain). It's also easy to implement a model handler that works like yours: just have the expression walker check for data argument and replace the call. We have full control as well.
But your approach is also good. It seems a bit more complicated to me but maybe that's just because I'm not used to handle R expressions this way. Also maybe I'm not appreciating fully the generality of the use cases you want to handle. And the corner cases of having something else than data piped into a modelling call are not so important so that shouldn't determine what approach we use.
I'm just a bit disappointed that with your approach data completions won't work without having typed ~
first. I know in advance this will bug me each time I write a formula ;). Do you think you could make it work?
Do you think you could make it work?
Nevermind, you're busy. I'll try to find a way to make this work once you've finished your implementation.
Well, that's a valid concern indeed. But at the same time you need to complete inside formulas only. In other argument positions completion is not needed. Of course we can always handle most common cases as you propose. Or we can figure out this automatically somehow. May be parse the example section of the documentation and infer from there if a function accepts a formula. Or, if a user already had a completion with formula in first argument we start completing on that first argument on subsequent occurrences.
I'm trying to understand your code right now. Can you confirm that in your approach we can use arbitrary combinations of handlers such as
# Data subsetted with `[` and subset()
lm(y ~ d(._.), subset(mtcars[1:10], select = c(disp, drat)))
It's not immediately clear to me.
Yes. This currently works because the whole data
argument is evaluated. It's here. I am still not sure if we need to evaluate it but I guess most of the users would want it to be evaluated.
The function factory .ess_comp_make_indata_handler
is similar but more complex because it allows for the argument to be named something else than "data".
The general idea is that you start from pointer position and asced the function calls lexically (d -> ~ -> lm)
and call corresponding handler. Each handler knows of his position in the parsed tree and also receives all the completions generated so far. Thus it can add new completions, modify existing ones or do nothing. Currently you will get completions from d
, ~
and lm, in that order, because I didn't bother to restrict any completions so far. Obviously, in the final version, ~
must be the terminal completion.
Ideally the stack should be a compositions of handlers. Example of this is ring middleware, a pattern which is ubiquitous in functional programming. I was planning to do that once the basic handlers are over. You jumped in much too early ;)
It's clearly not 99% finished as I was bragging but you got the idea. Once I spent so much time on it, let me already finish it my way and then you are free to rewrite it or scrap it altogether if you want.
This currently works because the whole data argument is evaluated. It's here. I am still not sure if we need to evaluate it but I guess most of the users would want it to be evaluated.
I think we should never evaluate arbitrary calls, or at least make it optional with the default of no eval (but if up to me I wouldn't even make it an option). So that we avoid evaluation of calls that
Indeed. We will do that. The only safe evaluation is when it comes to evaluating a symbol.
It'd be great to extend the places where auto-completion works in ESS.
We could start by looking for
data = x
arguments, and look for the names inx
. This way, auto-completion would be triggered while writing formulaes inlm()
. In the following example ESS would propose "disp" and "drat":I think this kind of autocompletion should be triggered inside all subcalls, so that it'd work inside
I()
withlm()
, or insideaes()
withggplot()
.To avoid undesired side effects, we could list the arguments where this kind of completion should take place, for example the "formula" argument of
lm()
,glm()
, the "mapping" argument ofggplot()
, etc. This should be defined in a user-customisable variable.Then we should think about pipelines. ggplot and dplyr are used to such an extent in the R community that they deserve special attention. We should recognise ggplot pipelines and enable
aes()
completions based on the layer-level data if it exists, and on the top-level data otherwise.With dplyr we have the additional difficulty that the contents may change down the pipelines, so we should check the results at every step. RStudio developers achieved this with good performance by making sure that each dplyr verb works with zero-row data frames. This way we can eval the pipeline with empty data and check the output names at each step with low cost.
Vitalie, do you think the current framework could be arranged to add such new sources of completion?