State info for each token

danyaPostfactum commented 11 years ago

It is very useful to have a state for each token, for example inside behaviours. It is often we need to know state instead of token type, because we can have state-depended behaviour. Is it too expensive?

nightwing commented 11 years ago

I've noticed this too, but not sure which is the best way to do it. changing tokens to be {value, type, start} increases memory usage on build/src/ace.js from 5.6mb to 6.1mb not much, but it will disable big memory reduction from https://github.com/ajaxorg/ace/issues/1188 Most of the information in the state is also in the type, so maybe making types more detailed can help

<language>.<sublanguage>?.<type>

e.g for js embedded in html >var a="xx\n"

html.tag.end
js.keyword
js
js.identifier
js.operator
js.string2.start
js.string2
js.string2.escape
js.string2.end

lddubeau commented 11 years ago

The desire for state information is caused by the fact that a tokenizer designed for the task of highlighting (which is what Ace wants) will return tokens that are ambiguous when it comes to performing other tasks.

The issue that brought me here is that when I have a string that spans multiple lines, I need to know where it starts and ends. So that the behavior code knows where a multi line string like the following begins and ends:

some stuff; a="This is a multi line
string
that ends here."; more stuff

The basic way highlighters work, they'll set the type of everything between quotes to string and be done with it. I've started refining this in my own code by setting the type to string.multiline which is not expected by the stylesheets but allows my behavior code, and a worker which also uses the tokenizer, to know that it is looking at a multi line.

In the case of my worker, I also need to know where a multi line string ends. So I'm about to create a string.multiline.tail to take care of that case.

But ultimately, refining types that are designed for highlighting so that they serve other purposes amounts to shoehorning, which is exactly what I've been doing so far with string.multilne. What would really be needed is not necessarily a reference to a state but an additional piece of data private to the mode and unambiguous. (This private data could even be completely integer based, if it happens to make a significant difference in memory usage.)

nightwing commented 11 years ago

I agree that state information is needed, but how to best store it, is unclear. Textmate/Sublime text have same problem too, and they have chosen to make really long scope selectors with every kind of information

Some of this information might be useful for highlighting too. There are some textmate themes that use different colors for php and js keywords in same file. Or assign special color to string.start string.end. It is also useful for implementing some behaviors in a way independent from modes (e.g. double clicking on string.start to select whole string). so it might be useful to move some frequently needed information into tokens, and add slower getState(row, column) method which will recompute state at the point startng from the end of the previous line.

Also a related question. some constructs is impossible to parse without keeping a stack of states (e.g. heredocs, string interpolation, lua comments), should getState return whole stack or only the innermost state?

lddubeau commented 11 years ago

Regarding stacks, my specific case would not benefit from merely having access to a state stack. To benefit from a stack, it would also have to be able to pass custom state objects to the tokenizer because my states are more complicated than what the tokenizer currently expects.

Regarding the larger issue, it would suffice for my purposes being able to encode a rule as follows:

{
  token : "string" // Purely for styling
  regex : '".*',
  mode_data : "multiline string", // Only for the mode and its associated code (behavior, worker, etc.)
  next : state + "_qqstring"
}

And tokens created from that rule would be { type: "string", mode_data: "multiline string", value: [... whatever... ] } mode_data would be additional data that no one would care about except the mode itself. In this way, the token field of the rule stops doing double (or triple) duty.

danyaPostfactum commented 11 years ago

Textmate/Sublime build token tree, so tokens are nested. This is very useful for css and html modes but does not seem to be useful for other modes. I think nested tokes implementation is too complex and has no usage in programming language modes. It is also unclear whether we should nest tokens in the renderer DOM or just use this feature internally (without renderer modification)

nightwing commented 11 years ago

Idea with parseTree is interesting https://github.com/zenparsing/codearea/blob/master/src/ParseTree.js but i agree it's too complex. Maybe we should just recompute state when needed similar to what CodeMirror does something like https://github.com/ajaxorg/ace/tree/state_at_column?

nightwing commented 9 years ago

@danyaPostfactum what type of nested token tree did you have in mind? When answering i thought you meant something like ast, where tokens ({type, value} pairs) are kept in a tree instead of an array. But now i realize there is another way to build token tree: by making token.types objects, and adding parent/child links to them. Something like

function Scope(parent, name, state) { 
  this.name = name;
  this.state = state || "start";
  this.children = Object.create(null);
  this.parent = parent;
};
({
  this.getChild = function(name) {
     return this.children[name] || (this.children[name] = new Scope(this, name));
  },
  this.getMode = function()  {
    // return closest parent that have a mode
  },
  this.getClass = function() {
    // return a string used for styling
  },
  this.is = function(name) {
    // return true if name is an ancestor of this
  }
}).call(Scope.prototype)

root = new Scope("_")
xml = new Scope("_",  "xml")
tag = xml.getChild("tag")
tag.getChild("attribute").is("tag")

so that we get nested structure for token types

_ {
    languageName {
        start {
            identifier {}
            keyword {}
            number {}
        }
        string {
            escape {}
            intepolation {
                identifier {}
                keyword {}
                number {}
                start {
                    identifier {}
                }
            }
        }
    }
}

but tokens retain form of [{value: string, type: Scope}, ....]

This will make logic for push/pop in modes much simpler, and should add very small overhead since only a few objects for each mode are created. Only way to generate lots of Scopes is to have deeply nested string interpolation, but that's not very common, and can't create more objects than there are tokens in the document.

JamesXNelson commented 8 years ago

My use case for state information would be line and column numbers for each token; I already have an AST parser a la javacc, and I have the full AST node graph in memory, complete with line/col numbers.

Rather than reinvent and maintain my highlighter with duplicate regex for every language feature, I want to be able to just tokenize all text and punctuation, then just ask my in memory AST what type that token is.

No amount of extra state information in type tokens would help w/ this particular case.

nightwing commented 8 years ago

@JamesXNelson if you do not want to create Ace highlighter, how do you want to to be able to just tokenize all text and punctuation?

JamesXNelson commented 8 years ago

I already have it all tokenized via JavaCC compiler running in Gwt.

I already have a compiled AST so I can transform, query and work with my language AST... I have BNF for the language, as well as a compiler that gives compiled type warnings. Those type warnings do not break the compile, but the parser will puke as soon as the syntax is invalid. Since the parser will fail if the source contains a grammar error, it's actually better that I had to do the work of inspecting the source twice, because the syntax highlighting will error out after the invalid section.

So, when I get the time to make the parser fault-tolerant and can still parse all remaining well formatted source code, I will likely want to figure out how to replace the tokenizer with one that just asks my parser/compiler "what are the token types and ranges for the source on line 152".

Until that time, I will stick with my highlighting parser, since I finally made an api for it that could handle push and pop in a way that did not erode my sanity (sorry, I was using an old version of ace then upgraded, and had a hell of a time getting state stacks to work). I now have something that is functional, readable and maintainable by someone other than myself, so there is no pressing need at the moment.

If there is any documentation or recommendations you might have on how to sub out the tokenizer, I would greatly appreciate it. I intend to run the compiler in a shared worker as well, so a compiled program can live longer than a single tab, but that is a fair way down the road from now.

aryeh-looker commented 5 years ago

This would be awesome to have. Another use-case is embedded highlighters. An embedded highlighter might use the same token name, say, as the outer highlighter. For example, the SQL highlighter uses the token value text, which is fairly ambiguous. If state were available, one could give one set of autocompletions for the token value text when in the embedded SQL highlighter and another when outside of it.

github-actions[bot] commented 2 years ago

This issue has not received any attention in 1 year. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

ajaxorg / ace

State info for each token #1227