Open vmarkovtsev opened 5 years ago
@vmarkovtsev Can you please elaborate what you propose exactly?
Because if you want to find imports, you don't need roles in Semantic. As you may know, you can just search for a node type. Same for an import name - if you don't need a string for a name but any subtree that corresponds to it, you can use a Name
field of an Import
.
But if you need to extract an import name as a string, things get more complicated regardless of the language and the roles won't help you.
I'm not sure why you refer to Java specifically, but maybe it's less typical for Java to use import aliases? It's not the case for Python, JS and a few others, however. So instead of a simple string import name, you will get two names - an alias. And we on Babelfish side cannot know if you want one string (references resolution) or another (dependency resolution). And it will be basically the same mess with roles.
Instead, we provide a helper that extracts the import name and concatenates it to a single string. For Python client, for example, Konst had the same kind of function. But we can add it to the client itself. Would it solve your issue?
I propose adding roles to the Python driver output, just like it already happens with Java. They are much easier to work with than doing alchemy with parent types, etc. That alchemy is not consistent across languages and is structure-dependent, so I end up splitting the logic between different languages, which kills many benefits of UASTs for me. On the other hand, doing a linear scan over the roles always worked before, and I don't see the reasons why it shouldn't for 99.(9)% of our mining use-cases.
You can use a
Name
field of anImport
Yeah, but for some languages, it is name
instead of Name
, for some, it is RuntimeImport
instead of Import
or even some clever indication that the file is included and not imported. Comments carry the value in Text
and sometimes text
, and there are also keywords which carry the value in @token
. Oh, and there are ofc Value
and value
for literals, if I remember correctly.
I have to normalize all of that zoo to the format that we can actually consume: the type of the entity (roles), the value of the entity and the span in the file. That format may be (very) lossy, but we don't care in 99.(9)% of our mining tasks.
Hence the Java driver's output is way better for me than Python's.
The helpers would not help here because I am currently working on a project that requires the unified data conversion for all the languages supported by Babelfish and serializing the results for unknown future tasks. E.g., semantic code search.
As you see, the problem is not with imports specifically. It is quite broad and covers all the entities we want to handle:
but for some languages, it is
name
instead ofName
This is either not true or is a severe bug. Can you please give a link to a UAST in Semantic mode or a source file that produced this?
... it is
RuntimeImport
instead ofImport
or even some clever indication that the file is included and not imported.
Right, for this specific case we can add roles. The intention is to have a schema that describes that RuntimeImport
is a special case of an Import
, but it's not that easy to describe it that way in a tree structure. We will add a role on all Import
-derived nodes for now.
Comments carry the value in
Text
and sometimestext
Again, this cannot happen in Semantic. You either work with a mix of Annotated and Semantic, or you found a bug where we don't convert a native AST subtree properly to Semantic UAST. Property names must always be the same in Semantic for the same type of a node (even derived ones).
the type of the entity (roles)
The type of an entity is a @type
for Semantic. Yes, for other nodes it makes sense to use @roles
since those nodes are not yet normalized. But in general @roles
are pretty vague, and it's a bit strange for me that you want to use a normalized structure, but insist on using roles. They don't have a good definition and may mean different things across languages (e.g. imports act differently). For me, it defeats the purpose of UAST, since you cannot associate the type with a specific implementation of imports. But if you okay with imprecise information for this project, I wonder why do you prefer Semantic over Annotated? Only because of the structural normalization?
the value of the entity
You mean the token, right? Or the unescaped/normalized value (e.g. import name, string literal value, etc)? Because those are different and have different assumptions.
E.g., semantic code search.
You can't rely on roles for semantic code search. We need to talk more about this specific use case. Specifically: why the current structure is not good enough. Assuming there is no "zoo" with name
vs Name
(it's probably a bug), I guess the question is only related to the lack of "RuntimeImport
is a subset of Import
" information, correct?
covers all the entities we want to handle
We need to have a proper call to discuss this. Specifically, this list includes a list of semantic concepts (e.g. functions) and syntactic concepts (e.g. keywords). I wonder why you need a mix of those. Not saying that it useless, but a good description of the use case may help us a lot.
This is either not true or is a severe bug. Can you please give a link to a UAST in Semantic mode or a source file that produced this?
I keep hitting these bugs over and over again. I don't have the bandwidth to file issues in each particular case, because their count is huge. May I suggest that you extract 10,000 random files from, say, PGA.v2, in each language, then put all the dictionary keys in a set and compare them with each other. Actually, the UASTs are already extracted, they are in /spark_storage/pga.v2/*/*.parquet
on the ML cluster. There is also a copy in Google Cloud.
Right, for this specific case we can add roles.
:+1: Thanks. However, my issue is about having roles in Python at all. Currently, the only place where I see the roles is the root node TBH.
The type of an entity is a @type for Semantic.
Nope. Example: uast:Identifier
inside uast:Import
in Go. My own, mining-oriented type of the entity is a "part of the import". The same in Python: the leaf has type uast:Identifier
, so I have to climb up to understand that we are inside an import statement.
I wonder why do you prefer Semantic over Annotated?
My current goal is reaching exactly the same structure for all the languages, exactly the same types, roles, etc. for all my simplistic targets. I don't want to specify the language when I query for imports, for example. I want to scan over the roles and get the results. I want to be lossy, none of my goals involve the backwards transform. "Destructive normalization". Deleting everything which cannot be expressed in a non-standard way, deleting outlying language features, reaching a state at which all the languages look exactly the same.
We do have another task with reconstructing the code in style-analyzer, but it has always been very language-dependent anyway. We used Annotated for that task and it worked.
You mean the token, right? Or the unescaped/normalized value (e.g. import name, string literal value, etc)? Because those are different and have different assumptions.
I mean everything at once. Any text or number associated with the node. The imports consist of several parts; in my ideal world, the parts are concatenated with a ".". Even if the language uses "/" or any other way to join the parts together, the normalized strings would be dotted.
But that's my ideal world. Currently, I am happy to get the raw token value without the quotes which is language-dependent.
The comments would be left-deindented, and stripped from left and right.
The literals would be stripped from quotes. Integers may be left unnormalized, just like they exist in the code.
Identifiers would be also merged by a dot. Ideally, I mean. In C++ you may have "::", in other languages, other ways exist. Normalizing everything would be so damn great. Even if it becomes harder to distinguish what is a namespace and what is a class name. I believe there can be workarounds with ignoreable precise tree augmentation.
You can't rely on roles for semantic code search.
In 99% of the use-cases, I can, and they would work perfectly. I am talking about simple, short queries. I built the schema, and I tried to use it, and it would work perfectly if the roles were assigned everywhere. That being said, I found the workaround with tracking nested @type
-s, so I am not blocked or anything.
We need to have a proper call to discuss this.
Yep, that's on my list!
Thanks for an elaborate reply, it really helps with a context of the issue!
my issue is about having roles in Python at all
Hmm, it is surprising that it has ~none. We will double-check it.
"Destructive normalization". Deleting everything which cannot be expressed in a non-standard way
Thanks, I now understand the use case you have. Yes, we don't even support that kind of normalization currently. I think it's worth adding a new transformation target for it. We definitely can do that in a short-mid term, but we need to discuss the details. Devil is always in the details.
But that's my ideal world.
We are not that far away from it, actually :) Yes, we currently may omit some nodes because we don't have a schema for it in Semantic. But, assuming the destructive normalization, we may ignore those inconsistencies and force the transformation anyway. Most of the normalization you mentioned (comments, string literals, qualified identifiers) is already performed in all drivers. So defining a new transformation target should solve this.
I built the schema
It would be great to see the schema as well during the call - this way we will know what is missing from our own schema or where we need a bit more abstraction.
You can't rely on roles for semantic code search.
In 99% of the use-cases, I can, and they would work perfectly.
I guess it depends on what you mean by semantic code search. But tracking nested types is definitely a better approach than roles. At least we can guarantee that those types won't collide as roles do. It may not be ideal still (e.g. missing information about RuntimeImport
s), but I think it's the way forward.
If you think about alternatives, the schema provided by Semantic (assuming destructive normalization) is the same as roles: e.g. role=Import > role=Name
is the same as type=Import -Name->
. But the approach we use with Semantic will work with arbitrary JSON-like objects regardless of where they come from. It may be important in the long run when we'll start pulling data from different sources.
Hmm, it is surprising that it has ~none.
Well, not really 0. Alex came to me 3 hours ago, confused, and proved that some nodes do have roles :smile: Imports and functions don't. E.g. the fizzbuzz example on play.bblf.sh.
It would be great to see the schema
That's no secret anyway.
CREATE TABLE uasts (
id Int32,
repo String,
lang String,
file String,
line Int32,
parents Array(Int32),
pkey String,
roles Array(Int16),
type String,
uptypes Array(String),
value String
) ENGINE = MergeTree() ORDER BY (repo, file, id);
id
- unique index in depth-first traversal order.parents
- the indexes of the parent nodes, in-order.pkey
- the key in the parent node. E.g. an index 0-1-2 if it is an array or "Path", "Names" if it is an object.uptypes
- the types of the parents, in-order. This competes with roles and serves the same goal.value
- the logical "value" of the node, just as I described before.I throw away all the nodes which have an empty value
or do not have a line number. So parents
references only the "visible" nodes. Alex says that this is equivalent to leaving the tokens only. That's not far from the truth.
@vmarkovtsev Thanks for sharing the schema! To be honest I'm a bit confused about the details.
Why parents
is an array? The node has a single parent only, assuming a tree. Also, if it does have multiple parents (assuming a graph) why pkey
is a single entry instead of an array as the parent
field is?
Also, is there a specific reason to copy the v1 XML-like structure (parent -> child, field
) instead of v2 JSON-like structure (parent, field -> child
)? Remember you promised us to use the new structure in new projects? ;)
parents
is the sequence of parents: parent, parent of the parent, parent of the parent of the parent, etc. pkey
is the key in the immediate parent. However, I don't save all the parents! Only those which are visible.
I need this structure to answer queries like "calls to foo inside function bar" fast. There are already 200 billion rows, so it has to be very simple. Imagine that ML generates ~100 function names which I scan on the first pass, then I do the second pass. Thus I avoid traversing the tree for the business use-cases. Ofc every simple linear schema is not good for storing graphs losslessly, but it works really well for what we want.
Ah, I see. Thanks for clarifying it.
Thanks for posting the schema, that's really helpful for context. One key thing I notice in this schema is that you don't really need (or want) the whole UAST, only certain parts of it. That's good, because efficiently querying a fully-squashed graph structure from a flat table requires knowing not just what parent/child relationships there are, but what their "meaning" is. Otherwise you have to resort to repeated scans to discover the nested structure.
Once you know which relationships matter, you can denormalize those so that they don't need repeated scans to recover the substructure. The schema above pretty clearly relies on the fact that some tool has done this flattening ahead of time (I assume something written using the library or XPath or some combination?)
(As an aside: This is the reason why generic graph databases are so complicated: To make efficient queries over graph structures without a priori denormalization takes some cleverness, and it will ~never be as fast as a custom-designed flat store for a query shape we know in advance).
Even with normalizations applied, UASTs do not have exactly the same structure across languages. So I think if we want to turn this prototype into a real product, we're going to need to be more explicit about which tree structures we want to flatten. I know, for example, that imports are interesting, and you've mentioned things like "call sites occurring in a function body". To make that reproducible across languages, we can't rely on naïvely grabbing parent-child relationships from the raw tree (even after transformations).
In some cases we might be able to hack it with XPath queries, but we've all seen that XPath is very sensitive to small changes in structure. I'd like to get away from having tools rely directly on the tree structure unless they really need nested syntax: For flat statistical aggregations like imports the answer is probably library functionality.
Since this is experimental I'm not too worried about the brittleness of the XPath mapping yet. But I'd like to be clear that XPath isn't a sustainable way to build this for a real tool, unless we commit to never changing anything about the exact structure each driver emits. Some combination of https://github.com/src-d/feature-idea/issues/153 and library functions for specifically-defined denormalizations seems like the right path forward.
The most of the semantic nodes do not have the assigned roles. This is very inconvenient for the majority of our mining tasks, because we do not want to write complex XPaths which take the structure into account. E.g. to get the imports, we need to take
uast:Identifier
-s which areName
-s underuast:RuntimeImport
, We want the way it works for Java.