Potential bug in graph inputter(?)

ioana-blue commented 5 years ago

I've started debugging and I'm looking at only one single sample in my inference file and printing some info on it.

This is my graph:

{"edges": [["child", 0, 1], ["child", 0, 2], ["child", 0, 3], ["child", 0, 4], ["child", 0, 5], ["child", 0, 6], ["child", 0, 7], ["child", 0, 8], ["child", 0, 9], ["child", 0, 10], ["child", 0, 11], ["child", 0, 12], ["child", 0, 13], ["child", 0, 16], ["child", 0, 27], ["child", 13, 14], ["child", 14, 15], ["child", 16, 17], ["child", 16, 19], ["child", 16, 20], ["child", 17, 18], ["child", 20, 24], ["child", 20, 26], ["child", 20, 21], ["child", 20, 23], ["child", 21, 22], ["child", 24, 25], ["child", 27, 28], ["child", 27, 29], ["child", 29, 32], ["child", 29, 33], ["child", 29, 35], ["child", 29, 36], ["child", 29, 38], ["child", 29, 39], ["child", 29, 40], ["child", 29, 42], ["child", 29, 43], ["child", 29, 44], ["child", 29, 50], ["child", 29, 30], ["child", 30, 31], ["child", 33, 34], ["child", 36, 37], ["child", 40, 41], ["child", 44, 48], ["child", 44, 45], ["child", 44, 47], ["child", 45, 46], ["child", 48, 49], ["NextToken", 1, 2], ["NextToken", 2, 3], ["NextToken", 3, 4], ["NextToken", 4, 5], ["NextToken", 5, 6], ["NextToken", 6, 7], ["NextToken", 7, 8], ["NextToken", 8, 9], ["NextToken", 9, 10], ["NextToken", 10, 11], ["NextToken", 11, 12], ["NextToken", 12, 15], ["NextToken", 15, 18], ["NextToken", 18, 19], ["NextToken", 19, 51], ["NextToken", 19, 22], ["NextToken", 22, 23], ["NextToken", 23, 25], ["NextToken", 25, 26], ["NextToken", 26, 28], ["NextToken", 28, 31], ["NextToken", 31, 32], ["NextToken", 32, 34], ["NextToken", 34, 35], ["NextToken", 35, 37], ["NextToken", 37, 38], ["NextToken", 38, 39], ["NextToken", 39, 41], ["NextToken", 41, 42], ["NextToken", 42, 43], ["NextToken", 43, 46], ["NextToken", 46, 47], ["NextToken", 47, 49], ["NextToken", 49, 50], ["NextToken", 51, 52], ["NextToken", 52, 23], ["last_lexical", 25, 18], ["last_lexical", 46, 25], ["last_lexical", 49, 41], ["last_use", 18, 25], ["last_use", 46, 18], ["last_use", 49, 41], ["computed_from", 18, 25], ["computed_from", 18, 22], ["return_to", 28, 0], ["last_write", 46, 18], ["Subtoken", 22, 51], ["Subtoken", 22, 52]], "node_labels": ["FunctionDef", "def", "wminkowski", "(", "u", ",", "v", ",", "p", ",", "w", ")", ":", "Expr", "Str", "string", "Assign", "Name", "w", "=", "Call", "Name", "_validate_weights", "(", "Name", "w", ")", "Return", "return", "Call", "Name", "minkowski", "(", "Name", "u", ",", "Name", "v", ",", "p=", "Name", "p", ",", "w=", "BinOp", "Name", "w", "**", "Name", "p", ")", "validate", "weights"], "backbone_sequence": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 18, 19, 22, 23, 25, 26, 28, 31, 32, 34, 35, 37, 38, 39, 41, 42, 43, 46, 47, 49, 50]}

This is my edge vocab:

child
nexttoken
last_use
last_lexical
subtoken
last_write
computed_from
return_to

I'm printing the features that are received by the __call__ function in the sequence graph to sequence to understand whether the representation of the incoming graph is correct and I think it's not. I'm still trying to figure out what all this means, so I could use some help.

This is what gets printed:

{'primary_path': array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 15, 18, 19, 22,
        23, 25, 26, 28, 31, 32, 34, 35, 37, 38, 39, 41, 42, 43, 46, 47,
        49, 50]]), 'primary_path_length': array([34], dtype=int32), 'graph': SparseTensorValue(indices=array([[ 0,  0,  0,  1],
       [ 0,  0,  0,  2],
       [ 0,  0,  0,  3],
       [ 0,  0,  0,  4],
       [ 0,  0,  0,  5],
       [ 0,  0,  0,  6],
       [ 0,  0,  0,  7],
       [ 0,  0,  0,  8],
       [ 0,  0,  0,  9],
       [ 0,  0,  0, 10],
       [ 0,  0,  0, 11],
       [ 0,  0,  0, 12],
       [ 0,  0,  0, 16],
       [ 0,  0,  0, 27],
       [ 0,  0, 14, 15],
       [ 0,  0, 16, 17],
       [ 0,  0, 16, 19],
       [ 0,  0, 16, 20],
       [ 0,  0, 17, 18],
       [ 0,  0, 20, 24],
       [ 0,  0, 20, 26],
       [ 0,  0, 20, 21],
       [ 0,  0, 20, 23],
       [ 0,  0, 21, 22],
       [ 0,  0, 24, 25],
       [ 0,  0, 27, 28],
       [ 0,  0, 27, 29],
       [ 0,  0, 29, 32],
       [ 0,  0, 29, 33],
       [ 0,  0, 29, 35],
       [ 0,  0, 29, 36],
       [ 0,  0, 29, 38],
       [ 0,  0, 29, 39],
       [ 0,  0, 29, 40],
       [ 0,  0, 29, 42],
       [ 0,  0, 29, 43],
       [ 0,  0, 29, 44],
       [ 0,  0, 29, 50],
       [ 0,  0, 29, 30],
       [ 0,  0, 30, 31],
       [ 0,  0, 33, 34],
       [ 0,  0, 36, 37],
       [ 0,  0, 40, 41],
       [ 0,  0, 44, 48],
       [ 0,  0, 44, 45],
       [ 0,  0, 44, 47],
       [ 0,  0, 45, 46],
       [ 0,  0, 48, 49],
       [ 0,  8,  1,  2],
       [ 0,  8,  2,  3],
       [ 0,  8,  3,  4],
       [ 0,  8,  4,  5],
       [ 0,  8,  5,  6],
       [ 0,  8,  6,  7],
       [ 0,  8,  7,  8],
       [ 0,  8,  8,  9],
       [ 0,  8,  9, 10],
       [ 0,  8, 10, 11],
       [ 0,  8, 11, 12],
       [ 0,  8, 12, 15],
       [ 0,  8, 15, 18],
       [ 0,  8, 18, 19],
       [ 0,  8, 19, 51],
       [ 0,  8, 19, 22],
       [ 0,  8, 22, 23],
       [ 0,  8, 23, 25],
       [ 0,  8, 25, 26],
       [ 0,  8, 26, 28],
       [ 0,  8, 28, 31],
       [ 0,  8, 31, 32],
       [ 0,  8, 32, 34],
       [ 0,  8, 34, 35],
       [ 0,  8, 35, 37],
       [ 0,  8, 37, 38],
       [ 0,  8, 38, 39],
       [ 0,  8, 39, 41],
       [ 0,  8, 41, 42],
       [ 0,  8, 42, 43],
       [ 0,  8, 43, 46],
       [ 0,  8, 46, 47],
       [ 0,  8, 47, 49],
       [ 0,  8, 49, 50],
       [ 0,  3, 25, 18],
       [ 0,  3, 46, 25],
       [ 0,  3, 49, 41],
       [ 0,  2, 18, 25],
       [ 0,  2, 46, 18],
       [ 0,  2, 49, 41],
       [ 0,  6, 18, 25],
       [ 0,  6, 18, 22],
       [ 0,  7, 28,  0],
       [ 0,  5, 46, 18],
       [ 0,  8, 22, 51]]), values=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1], dtype=int32), dense_shape=array([ 1,  9, 52, 52])), 'out_ids': array([[5714, 3003, 5715,   12,  975, 5716,  430, 5716,  110, 5716,  650,
          13,   45,  472,   57, 1484,  168,  650,   64,  209,  168, 5717,
          12,  168,  650,   13,   10,   10,  209,  168, 5718,   12,  168,
         975, 5716,  168,  430, 5716, 5719,  168,  110, 5716, 5720, 5721,
         168,  650,  510,  168,  110,   13,  350,  349]]), 'pointer_map': array([[b'functiondef', b'wminkowski', b',', b'_validate_weights',
        b'minkowski', b'p=', b'w=', b'binop']], dtype=object), 'length': array([52], dtype=int32), 'features': array([[   31,    30, 13324,     2,   146,     4,    93,     4,    81,
            4,   114,     3,    10,    13,    14,     9,     1,   114,
            8,     7,     1,  3974,     2,     1,   114,     3,    19,
           19,     7,     1,  3253,     2,     1,   146,     4,     1,
           93,     4,  2462,     1,    81,     4,  2784,    16,     1,
          114,    76,     1,    81,     3,   327,   159]])}

Some problems that I noticed in the representation of the graph:

["child", 0, 13] seems to be missing - assuming index 0 corresponds to the type child as in the edge vocab file
Note all the edges with type 8; there are only 0-7 entries in my vocab, that seems wrong to me; it corresponds to NextToken which is index 1 in my vocab
The SubToken (index 4 in the vocab) edges don't seem to be there
I don't know what this one corresponds to: [ 0, 8, 22, 51]]

I would appreciate if you take a look to clarify some of this. Meanwhile, I'm poking more at it. Thanks!

ioana-blue commented 5 years ago

Poking more index 8 above may still be ok as I see there is an unknown edge type being added. Still the other issues with missing nodes remains.

ioana-blue commented 5 years ago

Is it possible to still be a case sensitivity issue? in this case NextToken would be treated as unknown type (since it's different than nexttoken) and that's why it gets id 8? Then edge [ 0, 8, 22, 51] would correspond to the Subtoken edge. Still subtoken edge from 22->52 is missing. I think this might be what partially is going on.

ioana-blue commented 5 years ago

I'm pretty confident that's partially what's going on since I printed the following lookup: lookup = self.edge_vocabulary.lookup(tf.constant(['child' , 'nexttoken', 'last_use','last_lexical', 'subtoken', 'last_write', 'computed_from', 'return_to', 'NextToken'])) And I get - as expected -

[array([0, 1, 2, 3, 4, 5, 6, 7, 8])]

So NextToken goes to unknown. Same for Subtoken. Still there are some missing edges. Also, I'm already running with case_sensitive False so I'll have to poke around to see why the graphs are not transformed to lower case.

ioana-blue commented 5 years ago

I'm making progress. I lower-cased my graph, and now this is what I get for features:

{'primary_path': array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 15, 18, 19, 22,
        23, 25, 26, 28, 31, 32, 34, 35, 37, 38, 39, 41, 42, 43, 46, 47,
        49, 50]]), 'primary_path_length': array([34], dtype=int32), 'graph': SparseTensorValue(indices=array([[ 0,  0,  0,  1],
       [ 0,  0,  0,  2],
       [ 0,  0,  0,  3],
       [ 0,  0,  0,  4],
       [ 0,  0,  0,  5],
       [ 0,  0,  0,  6],
       [ 0,  0,  0,  7],
       [ 0,  0,  0,  8],
       [ 0,  0,  0,  9],
       [ 0,  0,  0, 10],
       [ 0,  0,  0, 11],
       [ 0,  0,  0, 12],
       [ 0,  0,  0, 16],
       [ 0,  0,  0, 27],
       [ 0,  0, 14, 15],
       [ 0,  0, 16, 17],
       [ 0,  0, 16, 19],
       [ 0,  0, 16, 20],
       [ 0,  0, 17, 18],
       [ 0,  0, 20, 24],
       [ 0,  0, 20, 26],
       [ 0,  0, 20, 21],
       [ 0,  0, 20, 23],
       [ 0,  0, 21, 22],
       [ 0,  0, 24, 25],
       [ 0,  0, 27, 28],
       [ 0,  0, 27, 29],
       [ 0,  0, 29, 32],
       [ 0,  0, 29, 33],
       [ 0,  0, 29, 35],
       [ 0,  0, 29, 36],
       [ 0,  0, 29, 38],
       [ 0,  0, 29, 39],
       [ 0,  0, 29, 40],
       [ 0,  0, 29, 42],
       [ 0,  0, 29, 43],
       [ 0,  0, 29, 44],
       [ 0,  0, 29, 50],
       [ 0,  0, 29, 30],
       [ 0,  0, 30, 31],
       [ 0,  0, 33, 34],
       [ 0,  0, 36, 37],
       [ 0,  0, 40, 41],
       [ 0,  0, 44, 48],
       [ 0,  0, 44, 45],
       [ 0,  0, 44, 47],
       [ 0,  0, 45, 46],
       [ 0,  0, 48, 49],
       [ 0,  1,  1,  2],
       [ 0,  1,  2,  3],
       [ 0,  1,  3,  4],
       [ 0,  1,  4,  5],
       [ 0,  1,  5,  6],
       [ 0,  1,  6,  7],
       [ 0,  1,  7,  8],
       [ 0,  1,  8,  9],
       [ 0,  1,  9, 10],
       [ 0,  1, 10, 11],
       [ 0,  1, 11, 12],
       [ 0,  1, 12, 15],
       [ 0,  1, 15, 18],
       [ 0,  1, 18, 19],
       [ 0,  1, 19, 51],
       [ 0,  1, 19, 22],
       [ 0,  1, 22, 23],
       [ 0,  1, 23, 25],
       [ 0,  1, 25, 26],
       [ 0,  1, 26, 28],
       [ 0,  1, 28, 31],
       [ 0,  1, 31, 32],
       [ 0,  1, 32, 34],
       [ 0,  1, 34, 35],
       [ 0,  1, 35, 37],
       [ 0,  1, 37, 38],
       [ 0,  1, 38, 39],
       [ 0,  1, 39, 41],
       [ 0,  1, 41, 42],
       [ 0,  1, 42, 43],
       [ 0,  1, 43, 46],
       [ 0,  1, 46, 47],
       [ 0,  1, 47, 49],
       [ 0,  1, 49, 50],
       [ 0,  3, 25, 18],
       [ 0,  3, 46, 25],
       [ 0,  3, 49, 41],
       [ 0,  2, 18, 25],
       [ 0,  2, 46, 18],
       [ 0,  2, 49, 41],
       [ 0,  6, 18, 25],
       [ 0,  6, 18, 22],
       [ 0,  7, 28,  0],
       [ 0,  5, 46, 18],
       [ 0,  4, 22, 51]]), values=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1], dtype=int32), dense_shape=array([ 1,  9, 52, 52])), 'out_ids': array([[5714, 3003, 5715,   12,  975, 5716,  430, 5716,  110, 5716,  650,
          13,   45,  472,   57, 1484,  168,  650,   64,  209,  168, 5717,
          12,  168,  650,   13,   10,   10,  209,  168, 5718,   12,  168,
         975, 5716,  168,  430, 5716, 5719,  168,  110, 5716, 5720, 5721,
         168,  650,  510,  168,  110,   13,  350,  349]]), 'pointer_map': array([[b'functiondef', b'wminkowski', b',', b'_validate_weights',
        b'minkowski', b'p=', b'w=', b'binop']], dtype=object), 'length': array([52], dtype=int32), 'features': array([[   31,    30, 13324,     2,   146,     4,    93,     4,    81,
            4,   114,     3,    10,    13,    14,     9,     1,   114,
            8,     7,     1,  3974,     2,     1,   114,     3,    19,
           19,     7,     1,  3253,     2,     1,   146,     4,     1,
           93,     4,  2462,     1,    81,     4,  2784,    16,     1,
          114,    76,     1,    81,     3,   327,   159]])}

So I managed to get rid of the unknown edge types, but the problem with missing edges remaind the child 0-13 and subtoken 22-52

ioana-blue commented 5 years ago

There are 5 missing edges, trying to find them all. [0, 0, 0, 13] [0, 0, 13, 14] [0, 4, 22, 52] [0, 1, 51, 52] [0, 1, 52, 23]

ioana-blue commented 5 years ago

My understanding is that there is a cutoff for the number of nodes at 500. Are there any other cutoffs in place?

CoderPat commented 5 years ago

Yes it is. I digged into this. There was in fact a "bug" in the sequenced graph inputter. I was prunning nodes that weren't connect to the remaining nodes in the main sequence, but this was removing nodes that shouldn't be removed. I fixed this by now running a full BFS. This might make the pre-processing a bit slower, let me know if its too impactful. The change was on OpenGNN so you need to install the latest version

ioana-blue commented 5 years ago

Got it. Thanks for looking into this! I'll get a chance to retry tomorrow. I'll let you know how it goes.

ioana-blue commented 5 years ago

For the graph above I can confirm that now the correct graph is inputted so hopefully this bug was fixed.

CoderPat / structured-neural-summarization

Potential bug in graph inputter(?) #13