Question on graph node labels (python dataset version)

CoderPat / structured-neural-summarization

A repository with the code for the paper with the same title

MIT License

74 stars 26 forks source link

Question on graph node labels (python dataset version) #4

Closed ioana-blue closed 5 years ago

ioana-blue commented 5 years ago

I know the python dataset was not used for the paper. Still, thanks for sharing that code as well. I have a question about node labels. Looking at the AST visitor, for the binary operator Not there is a node with label (i.e., single space) being created.

First, why is that node necessary/useful?

Second, the label will be stripped down and it becomes nothing, and then later on when the vocabulary for the nodes is loaded, it complains because it finds (an unexpected) empty line.

Third, is the code prepared to use a node label that is (i.e., one space)?

Thanks!

CoderPat commented 5 years ago

These code is quite old and it was used with earlier versions of code that didnt complain about empty. You are right that it does nothing, so it's theoretically a bug. However, besides the vocabulary building phase, it's just like any other string, so code should work. However will fix it.

ioana-blue commented 5 years ago

I still see some empty node labels even after these fixes. I'm going to see if I can find the culprit and let you know. I think they are coming from traversing node with. I'll look into it soon.

ioana-blue commented 5 years ago

I think I found the problem. This line: https://github.com/CoderPat/structured-neural-summarization/blob/master/parsers/sourcecode/barone/ast_graph_generator.py#L275

there is an empty node being created unnecessarily. I think it should read:

if idx:
      self.terminal(', ')

From your experience, are these separator commas necessary/useful? Wondering if they add anything to the graph...

Thanks!

CoderPat commented 5 years ago

Thanks for the code review, I'll fix it. Intuitivelly they shouldn't matter too much. However, like most ML models, we cannot tell what it's doing and it might be that the comma separating variables carries some semantic meaning to the model, but I don't have experimental results to back this up. I'll would just leave it there since it then produces a more faithful representation of the code, but feel free to experiment