Closed ioana-blue closed 5 years ago
Poking more index 8 above may still be ok as I see there is an unknown edge
type being added. Still the other issues with missing nodes remains.
Is it possible to still be a case sensitivity issue? in this case NextToken
would be treated as unknown type (since it's different than nexttoken
) and that's why it gets id 8?
Then edge [ 0, 8, 22, 51]
would correspond to the Subtoken edge. Still subtoken edge from 22->52 is missing. I think this might be what partially is going on.
I'm pretty confident that's partially what's going on since I printed the following lookup:
lookup = self.edge_vocabulary.lookup(tf.constant(['child' , 'nexttoken', 'last_use','last_lexical', 'subtoken', 'last_write', 'computed_from', 'return_to', 'NextToken']))
And I get - as expected -
[array([0, 1, 2, 3, 4, 5, 6, 7, 8])]
So NextToken goes to unknown. Same for Subtoken. Still there are some missing edges. Also, I'm already running with case_sensitive False so I'll have to poke around to see why the graphs are not transformed to lower case.
I'm making progress. I lower-cased my graph, and now this is what I get for features:
{'primary_path': array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 18, 19, 22,
23, 25, 26, 28, 31, 32, 34, 35, 37, 38, 39, 41, 42, 43, 46, 47,
49, 50]]), 'primary_path_length': array([34], dtype=int32), 'graph': SparseTensorValue(indices=array([[ 0, 0, 0, 1],
[ 0, 0, 0, 2],
[ 0, 0, 0, 3],
[ 0, 0, 0, 4],
[ 0, 0, 0, 5],
[ 0, 0, 0, 6],
[ 0, 0, 0, 7],
[ 0, 0, 0, 8],
[ 0, 0, 0, 9],
[ 0, 0, 0, 10],
[ 0, 0, 0, 11],
[ 0, 0, 0, 12],
[ 0, 0, 0, 16],
[ 0, 0, 0, 27],
[ 0, 0, 14, 15],
[ 0, 0, 16, 17],
[ 0, 0, 16, 19],
[ 0, 0, 16, 20],
[ 0, 0, 17, 18],
[ 0, 0, 20, 24],
[ 0, 0, 20, 26],
[ 0, 0, 20, 21],
[ 0, 0, 20, 23],
[ 0, 0, 21, 22],
[ 0, 0, 24, 25],
[ 0, 0, 27, 28],
[ 0, 0, 27, 29],
[ 0, 0, 29, 32],
[ 0, 0, 29, 33],
[ 0, 0, 29, 35],
[ 0, 0, 29, 36],
[ 0, 0, 29, 38],
[ 0, 0, 29, 39],
[ 0, 0, 29, 40],
[ 0, 0, 29, 42],
[ 0, 0, 29, 43],
[ 0, 0, 29, 44],
[ 0, 0, 29, 50],
[ 0, 0, 29, 30],
[ 0, 0, 30, 31],
[ 0, 0, 33, 34],
[ 0, 0, 36, 37],
[ 0, 0, 40, 41],
[ 0, 0, 44, 48],
[ 0, 0, 44, 45],
[ 0, 0, 44, 47],
[ 0, 0, 45, 46],
[ 0, 0, 48, 49],
[ 0, 1, 1, 2],
[ 0, 1, 2, 3],
[ 0, 1, 3, 4],
[ 0, 1, 4, 5],
[ 0, 1, 5, 6],
[ 0, 1, 6, 7],
[ 0, 1, 7, 8],
[ 0, 1, 8, 9],
[ 0, 1, 9, 10],
[ 0, 1, 10, 11],
[ 0, 1, 11, 12],
[ 0, 1, 12, 15],
[ 0, 1, 15, 18],
[ 0, 1, 18, 19],
[ 0, 1, 19, 51],
[ 0, 1, 19, 22],
[ 0, 1, 22, 23],
[ 0, 1, 23, 25],
[ 0, 1, 25, 26],
[ 0, 1, 26, 28],
[ 0, 1, 28, 31],
[ 0, 1, 31, 32],
[ 0, 1, 32, 34],
[ 0, 1, 34, 35],
[ 0, 1, 35, 37],
[ 0, 1, 37, 38],
[ 0, 1, 38, 39],
[ 0, 1, 39, 41],
[ 0, 1, 41, 42],
[ 0, 1, 42, 43],
[ 0, 1, 43, 46],
[ 0, 1, 46, 47],
[ 0, 1, 47, 49],
[ 0, 1, 49, 50],
[ 0, 3, 25, 18],
[ 0, 3, 46, 25],
[ 0, 3, 49, 41],
[ 0, 2, 18, 25],
[ 0, 2, 46, 18],
[ 0, 2, 49, 41],
[ 0, 6, 18, 25],
[ 0, 6, 18, 22],
[ 0, 7, 28, 0],
[ 0, 5, 46, 18],
[ 0, 4, 22, 51]]), values=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1], dtype=int32), dense_shape=array([ 1, 9, 52, 52])), 'out_ids': array([[5714, 3003, 5715, 12, 975, 5716, 430, 5716, 110, 5716, 650,
13, 45, 472, 57, 1484, 168, 650, 64, 209, 168, 5717,
12, 168, 650, 13, 10, 10, 209, 168, 5718, 12, 168,
975, 5716, 168, 430, 5716, 5719, 168, 110, 5716, 5720, 5721,
168, 650, 510, 168, 110, 13, 350, 349]]), 'pointer_map': array([[b'functiondef', b'wminkowski', b',', b'_validate_weights',
b'minkowski', b'p=', b'w=', b'binop']], dtype=object), 'length': array([52], dtype=int32), 'features': array([[ 31, 30, 13324, 2, 146, 4, 93, 4, 81,
4, 114, 3, 10, 13, 14, 9, 1, 114,
8, 7, 1, 3974, 2, 1, 114, 3, 19,
19, 7, 1, 3253, 2, 1, 146, 4, 1,
93, 4, 2462, 1, 81, 4, 2784, 16, 1,
114, 76, 1, 81, 3, 327, 159]])}
So I managed to get rid of the unknown edge types, but the problem with missing edges remaind the child 0-13 and subtoken 22-52
There are 5 missing edges, trying to find them all. [0, 0, 0, 13] [0, 0, 13, 14] [0, 4, 22, 52] [0, 1, 51, 52] [0, 1, 52, 23]
My understanding is that there is a cutoff for the number of nodes at 500. Are there any other cutoffs in place?
Yes it is. I digged into this. There was in fact a "bug" in the sequenced graph inputter. I was prunning nodes that weren't connect to the remaining nodes in the main sequence, but this was removing nodes that shouldn't be removed. I fixed this by now running a full BFS. This might make the pre-processing a bit slower, let me know if its too impactful. The change was on OpenGNN so you need to install the latest version
Got it. Thanks for looking into this! I'll get a chance to retry tomorrow. I'll let you know how it goes.
For the graph above I can confirm that now the correct graph is inputted so hopefully this bug was fixed.
I've started debugging and I'm looking at only one single sample in my inference file and printing some info on it.
This is my graph:
This is my edge vocab:
I'm printing the features that are received by the
__call__
function in the sequence graph to sequence to understand whether the representation of the incoming graph is correct and I think it's not. I'm still trying to figure out what all this means, so I could use some help.This is what gets printed:
Some problems that I noticed in the representation of the graph:
["child", 0, 13]
seems to be missing - assuming index 0 corresponds to the typechild
as in the edge vocab file[ 0, 8, 22, 51]]
I would appreciate if you take a look to clarify some of this. Meanwhile, I'm poking more at it. Thanks!