[feature] Include the information about full tokens

bblfsh / sdk

Babelfish driver SDK

GNU General Public License v3.0

23 stars 27 forks source link

[feature] Include the information about full tokens #291

Open vmarkovtsev opened 6 years ago

vmarkovtsev commented 6 years ago

Strings, comments, etc. have their Token set to the inner value of the token. E.g. in Python "hello" has Token hello (no quotes). This is all good and logical.

However, we discard information about the real token - quote characters, comment characters, etc. It is needed to reproduce the original source code from a UAST. I have two possible solution proposals:

Add "FullToken" for those nodes which need it.
Add "TokenPrefix" and "TokenSuffix".

juanjux commented 6 years ago

Comments should have the character used, prefix and suffix in the semantic UAST "Comment" object. For strings, at least in the Python and Ruby drivers, unfortunately the native AST doesn't provide the string type so this won't be possible for all drivers unless we parse the source code ourselves.

I'll leave this open just in case we find a workable solution in the future.

vmarkovtsev commented 6 years ago

The current workaround is simple: I look at the difference between file_contents[start_position.offset:end_position.offset] and Token and record prefixes and suffixes.

dennwc commented 6 years ago

Token as a concept won't work in the long run, so I think we should provide a helper that selects a source file content based on positions of nodes, as @vmarkovtsev mentioned.

For example, what is the token of do ... while? This will get more and more complex once we start working with semantic concepts for classes.

juanjux commented 6 years ago

They work pretty well... for identifiers and literals. For statements and reserved words, as you proved, they're problematic (same happens with "from x import y" in Python which is a single node with children).

Maybe we should make a distinction between a token and a representation.

dennwc commented 6 years ago

The token is something that exists in the source code, Egor mentioned a few times that he expects tokens to be valid for all node types, which cannot be the case with the current model.

I would rather go with semantic concepts, so Comments have text, prefix, etc and String (literal) has a value and quotes. Tokens can be provided with positional info. Since UAST v2 allows more than 2 positional fields, we can define few more to represent start/end positions of different keywords in the statement.

juanjux commented 6 years ago

Even with semantic objects it would be nice to keep the concept either as a single unified name or as some kind of field metadata so XPath queries doesn't have to match every semantic object to retrieve a different field in each which happens now as @smacker said the other day.