Open vmarkovtsev opened 6 years ago
Comments should have the character used, prefix and suffix in the semantic UAST "Comment" object. For strings, at least in the Python and Ruby drivers, unfortunately the native AST doesn't provide the string type so this won't be possible for all drivers unless we parse the source code ourselves.
I'll leave this open just in case we find a workable solution in the future.
The current workaround is simple: I look at the difference between file_contents[start_position.offset:end_position.offset]
and Token and record prefixes and suffixes.
Token as a concept won't work in the long run, so I think we should provide a helper that selects a source file content based on positions of nodes, as @vmarkovtsev mentioned.
For example, what is the token of do ... while
? This will get more and more complex once we start working with semantic concepts for classes.
They work pretty well... for identifiers and literals. For statements and reserved words, as you proved, they're problematic (same happens with "from x import y" in Python which is a single node with children).
Maybe we should make a distinction between a token and a representation.
The token is something that exists in the source code, Egor mentioned a few times that he expects tokens to be valid for all node types, which cannot be the case with the current model.
I would rather go with semantic concepts, so Comments have text, prefix, etc and String (literal) has a value and quotes. Tokens can be provided with positional info. Since UAST v2 allows more than 2 positional fields, we can define few more to represent start/end positions of different keywords in the statement.
Even with semantic objects it would be nice to keep the concept either as a single unified name or as some kind of field metadata so XPath queries doesn't have to match every semantic object to retrieve a different field in each which happens now as @smacker said the other day.
Strings, comments, etc. have their
Token
set to the inner value of the token. E.g. in Python"hello"
has Tokenhello
(no quotes). This is all good and logical.However, we discard information about the real token - quote characters, comment characters, etc. It is needed to reproduce the original source code from a UAST. I have two possible solution proposals: