Graphs produced by AST visitor compatible with the rest of the code?

ioana-blue commented 5 years ago

Just wondering if the graphs produced by the AST visitor for python source code are compatible with the rest of the code (in particular the inputters). I'm training for 5k steps and I'm getting pure garbage. I'll debug tomorrow, but it could save me lots of time to know whether the subtoken graphs are supposed to be compatible and if not, what I need to do to make them so. Thanks!

ioana-blue commented 5 years ago

In particular, all predicted sequences are the same irrespective of the input (which I hope it's going to be easier to debug compared to if they were random garbage). Any hunch what could be wrong? Wondering if you've came across this while working with the model. Thanks!

CoderPat commented 5 years ago

I think I know whats going on. Originally code was supposed to run with a subtokenizer inputter, where tokens where subtokenized and their embeddings averaged (see this class). However we ended using a different option, where we unroll the subtokens and make them a chain (see this script. I would recommend using the second approach since it is the most tested on and doesn't require changing the model. Let me know if this doesn't make sense, I'll update the docs to reflect this when I have time

ioana-blue commented 5 years ago

It makes sense what you're saying, but I'm already using the subtoken graphs. Also, when I'm creating the vocab files, I'm using the subtoken graphs (I assume that the labels/subtokens are the new vocab words). I still see garbage.

Let me paste a small graph from my input, if you have time to take a look. Thank you for your help, much appreciated!

{"edges": [["child", 0, 1], ["child", 0, 2], ["child", 0, 3], ["child", 0, 4], ["child", 0, 5], ["child", 0, 6], ["child", 0, 7], ["child", 0, 8], ["child", 0, 9], ["child", 0, 12], ["child", 9, 10], ["child", 10, 11], ["child", 12, 13], ["child", 12, 14], ["child", 14, 25], ["child", 14, 26], ["child", 14, 15], ["child", 15, 16], ["child", 15, 24], ["child", 15, 21], ["child", 15, 22], ["child", 16, 17], ["child", 16, 19], ["child", 16, 20], ["child", 17, 18], ["child", 22, 23], ["child", 26, 32], ["child", 26, 33], ["child", 26, 27], ["child", 27, 28], ["child", 27, 30], ["child", 27, 31], ["child", 28, 29], ["NextToken", 1, 2], ["NextToken", 1, 34], ["NextToken", 2, 3], ["NextToken", 3, 4], ["NextToken", 4, 5], ["NextToken", 5, 6], ["NextToken", 6, 7], ["NextToken", 7, 8], ["NextToken", 8, 11], ["NextToken", 11, 13], ["NextToken", 13, 18], ["NextToken", 18, 19], ["NextToken", 19, 20], ["NextToken", 19, 38], ["NextToken", 20, 21], ["NextToken", 21, 23], ["NextToken", 23, 24], ["NextToken", 24, 25], ["NextToken", 25, 29], ["NextToken", 29, 30], ["NextToken", 30, 41], ["NextToken", 30, 31], ["NextToken", 31, 32], ["NextToken", 32, 33], ["NextToken", 34, 35], ["NextToken", 35, 36], ["NextToken", 36, 37], ["NextToken", 37, 3], ["NextToken", 38, 39], ["NextToken", 39, 40], ["NextToken", 40, 21], ["NextToken", 41, 42], ["NextToken", 42, 43], ["NextToken", 43, 32], ["return_to", 13, 0], ["last_lexical", 29, 18], ["last_use", 29, 18], ["Subtoken", 2, 34], ["Subtoken", 2, 35], ["Subtoken", 2, 36], ["Subtoken", 2, 37], ["Subtoken", 20, 40], ["Subtoken", 20, 38], ["Subtoken", 20, 39], ["Subtoken", 31, 41], ["Subtoken", 31, 42], ["Subtoken", 31, 43]], "node_labels": ["FunctionDef", "def", "_estimate_weighted_log_prob", "(", "self", ",", "X", ")", ":", "Expr", "Str", "string", "Return", "return", "BinOp", "Call", "Attribute", "Name", "self", ".", "_estimate_log_prob", "(", "Name", "X", ")", "+", "Call", "Attribute", "Name", "self", ".", "_estimate_log_weights", "(", ")", "estimate", "weighted", "log", "prob", "estimate", "log", "prob", "estimate", "log", "weights"], "backbone_sequence": [1, 2, 3, 4, 5, 6, 7, 8, 11, 13, 18, 19, 20, 21, 23, 24, 25, 29, 30, 31, 32, 33]}

PS: I'll start printing tensor values to stare at the network today. I wanted to get it to work since I'm off for vacation tomorrow, but it doesn't look promising :) I'll be back to bug you in a week :)

ioana-blue commented 5 years ago

The subtoken graph looked ok to me.

CoderPat commented 5 years ago

Sorry for this "non-reproducible" code. It could be some case sensitive problems (although I'm pretty sure I made sure everything was lower cased by default). Ill try to see if I have time to run it on python code eventually. Just out of curiosity, what dataset are you using? Also double check that the vocabs look ok.

ioana-blue commented 5 years ago

I'm using my own dataset, hope to show some results in a future publication and make it public at some point. Can't share it until I get all approvals in place, etc. :/

So here it is what I'm thinking: it must be an "easy" fix because if it were something like case sensitivity, I'm guessing it wouldn't produce the same garbage result irrespective of the input. I think after 5K training steps it produces return the for any input I give it.

My dataset is small, one thing that I should check is that it's not too small to train the network, but I didn't get the impression the network is huge.

ioana-blue commented 5 years ago

BTW, I really appreciate that you put out the code. I understood I took a risk by using the python part that was not included in the paper. I hope, even with these days of poking and debugging, it will save me time in the end. Thanks for your help!

CoderPat commented 5 years ago

No worries about the dataset.

Case sensitive / vocabs would match exactly to a scenario where the output is exactly the same since all the tokens would be matched to UNK or something, so its good to rule these problems out. Also when looking at tensor values, look at tokens indexes right after the vocabulary and see if they are different.

Regarding model, while the model is simple, code tokens tend to be quite sparse and given that documentation is normally quite diverse this might lead to problems. Also GNNs (in particular GGNN) sometimes are a bit funky with training. My advice is to make sure you "treat" and pre-process enough the output docs (normalize stems, remove weird stuff, etc...) to help the model.

ioana-blue commented 5 years ago

BTW, the scripts do use case_sensitive=True but everything is lower case so this shouldn't matter? shall I call train and eval with case_sensitive=False ?

CoderPat commented 5 years ago

Ye false! f*ck. that might be your problem since the work "Function" doesn't exist on the vocab, only "function", sorry about that

ioana-blue commented 5 years ago

well. I see your frustration but I don't understand it. because all my vocab is already in lower case, so even if I say, yeah, case sensitive, everything is already in lower case, so it shouldn't matter. but I'll run with that false. and I'll open an issue so you remember to fix it in the scripts for other enthusiastic users :)

CoderPat commented 5 years ago

well, when I say case sensitive, I mean the model will distinguish between lower and upper case

ioana-blue commented 5 years ago

I understand that. BUT all the processing you've done in the AST produces lower case, so there is no distinction to be made. Don't take me wrong, I'd love for this to be the problem but I have zero hope it actually causes any grief in this case.

CoderPat commented 5 years ago

the problem isnt the terminal ones, its the non terminals ones, the snippet you sent me has UpperCase stuff for example

CoderPat commented 5 years ago

You can also build a case sensitive vocabulary and try to run the model on that. It will just be less sample efficient

ioana-blue commented 5 years ago

You mean NextToken and such? If that's the problem, I'll buy you a beer when I get to meet you :))))

ioana-blue commented 5 years ago

You're right, the AST nodes are case sensitive. Somehow I missed that.

ioana-blue commented 5 years ago

And to add to the injury, the generated vocab is lower cased. ok, this is a potential problem, I hope your intuition is right and it's as big of a problem as you think. I'd be thrilled!

ioana-blue commented 5 years ago

I'm starting to understand how big of a problem this would be. pretty much all AST nodes will be UNK. ok, I'll keep you posted. Thank you for your help!

ioana-blue commented 5 years ago

I don't think I got rid of the problem. I think I got rid of a problem, but not all. I see the same thing, after 5K iterations in training, I do an inference and all I get is return the. BUT there is another weird thing going on. I have only 2 graphs in my inference file, the result produces 3K+ predictions. This smells of some inputter problem.

The good news is that I have it running in a notebook, so now I can start trying to poke at it. I won't have much time today, so I'll probably be back in 10 days or so to poke at it.

CoderPat / structured-neural-summarization

Graphs produced by AST visitor compatible with the rest of the code? #10