ChunchuanLv / AMR_AS_GRAPH_PREDICTION

53 stars 16 forks source link

Bugs in the code #3

Open jcyk opened 5 years ago

jcyk commented 5 years ago

The code contains bugs and cannot produce reasonable results. For example, the code tries to sort all; training instances in a batch by length, but the roots are NOT reordered accordingly see here.

The authors should release reliable or at least runnable code, if they want make their work seriously open-source.

ChunchuanLv commented 5 years ago

Oh man!! Thanks for spotting this. Good news is that this bug probably only have impact on my training-time pseudo evaluation. For the actual evaluation, I produce results into a file and use external tools for that. Did you get any other bug when you are trying to run my code? (at least one group has been able to re-train the model)

On Fri, 28 Sep 2018 at 05:44, Deng Cai notifications@github.com wrote:

The code contains bugs and cannot produce reasonable results. For example, the code try to sort all; training instances in a batch by length, but the roots are NOT reordered accordingly see here https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/blob/master/parser/DataIterator.py#L278 .

The authors should release reliable or at least runnable code, if they want make their work seriously open-source.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/ADs1bWA4k0IVAXT5gz7HmAWT4ldwzIqpks5ufhmwgaJpZM4W-cAA .

jcyk commented 5 years ago

Thanks for your reply! I run your released code on LDC2017T10 dataset. The result using amr-evaluation-tool-enhanced is: Smatch -> P: 0.739, R: 0.713, F: 0.725, which is not that strong compared to the parser reported in your paper, though we are using slightly different?! datasets.

ChunchuanLv commented 5 years ago

I am not actually sure how much different those two data sets are. A major part of the performance drop comes from the tokenization, my paper version starts with the tokenized version of AMR files. So the pre-processing is not the same. (although I was getting around 0.740 for F1 and another group gets 0.740 and 0.738 for two runs)

I did this as quite a lot people are interested in applying AMR parser to other problems, and the amr tokenized version use some other tools that I find hard to integrate into a python system.

If you want, I can give you a java based pre-processing script. but you need to start with the tokenized amr files, and probably figure out how the pipeline should work.....

On Wed, 3 Oct 2018 at 12:12, Deng Cai notifications@github.com wrote:

Thanks for your reply! I run your released code on LDC2017T10 dataset. The result using amr-evaluation-tool-enhanced https://github.com/ChunchuanLv/amr-evaluation-tool-enhanced is: Smatch -> P: 0.739, R: 0.713, F: 0.725, which is not that strong compared to the parser reported in your paper, though we are using slightly different?! datasets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/issues/3#issuecomment-426598572, or mute the thread https://github.com/notifications/unsubscribe-auth/ADs1bbDgAyjTFqpL8DjiUrOTlwANEsjFks5uhJu2gaJpZM4W-cAA .

jcyk commented 5 years ago

Thanks for your quick reply. Yes, I would like to try it. Thanks for your sharing. My Email

ChunchuanLv commented 5 years ago

I emailed you the file. btw, is your result from a trained model or its' from running my pre-trained model?

On Wed, 3 Oct 2018 at 12:32, Deng Cai notifications@github.com wrote:

Thanks for your quick reply. Yes, I would like to try it. Thanks for your sharing. My Email thisisjcykcd@gmail.com

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/issues/3#issuecomment-426603368, or mute the thread https://github.com/notifications/unsubscribe-auth/ADs1bbN3686u24-aQ7fLcxJFSHAtbtcYks5uhKBkgaJpZM4W-cAA .

jcyk commented 5 years ago

I trained a new model on LDC2017T10 following the instructions in the readme. All with default settings.

StalVars commented 5 years ago

I am not actually sure how much different those two data sets are. A major part of the performance drop comes from the tokenization, my paper version starts with the tokenized version of AMR files. So the pre-processing is not the same. (although I was getting around 0.740 for F1 and another group gets 0.740 and 0.738 for two runs) I did this as quite a lot people are interested in applying AMR parser to other problems, and the amr tokenized version use some other tools that I find hard to integrate into a python system. If you want, I can give you a java based pre-processing script. but you need to start with the tokenized amr files, and probably figure out how the pipeline should work..... On Wed, 3 Oct 2018 at 12:12, Deng Cai @.**> wrote: Thanks for your reply! I run your released code on LDC2017T10 dataset. The result using amr-evaluation-tool-enhanced https://github.com/ChunchuanLv/amr-evaluation-tool-enhanced is: Smatch -> P: 0.739, R: 0.713, F: 0.725, which is not that strong compared to the parser reported in your paper, though we are using slightly different?!* datasets. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ADs1bbDgAyjTFqpL8DjiUrOTlwANEsjFks5uhJu2gaJpZM4W-cAA .

Is this java based pre-processing script also the uploaded one at "AMR_FEATURE" folder?

ChunchuanLv commented 5 years ago

Yes.

On Wed, 5 Dec 2018 at 17:09, Stalin Varanasi notifications@github.com wrote:

I am not actually sure how much different those two data sets are. A major part of the performance drop comes from the tokenization, my paper version starts with the tokenized version of AMR files. So the pre-processing is not the same. (although I was getting around 0.740 for F1 and another group gets 0.740 and 0.738 for two runs) I did this as quite a lot people are interested in applying AMR parser to other problems, and the amr tokenized version use some other tools that I find hard to integrate into a python system. If you want, I can give you a java based pre-processing script. but you need to start with the tokenized amr files, and probably figure out how the pipeline should work..... … <#m-6059685883633531907> On Wed, 3 Oct 2018 at 12:12, Deng Cai @.**> wrote: Thanks for your reply! I run your released code on LDC2017T10 dataset. The result using amr-evaluation-tool-enhanced https://github.com/ChunchuanLv/amr-evaluation-tool-enhanced is: Smatch -> P: 0.739, R: 0.713, F: 0.725, which is not that strong compared to the parser reported in your paper, though we are using slightly different?!* datasets. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment) https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/issues/3#issuecomment-426598572>, or mute the thread https://github.com/notifications/unsubscribe-auth/ADs1bbDgAyjTFqpL8DjiUrOTlwANEsjFks5uhJu2gaJpZM4W-cAA .

Is this java based pre-processing script also the uploaded one at "AMR_FEATURE" folder?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/issues/3#issuecomment-444564427, or mute the thread https://github.com/notifications/unsubscribe-auth/ADs1bdH5i1m7ll3LxJIdFx3CHDkkYVnHks5u1_2ugaJpZM4W-cAA .

sheng-z commented 5 years ago

Hi Chunchuan. Do I understand the conversation here correctly? If I train your model on the original AMR data (not the tokenized version which you didn't make public), the model will only achieve 72.5 on LDC2017T10. Even if you reran on your tokenized version, you could only get 74.0 or 73.8 (which is slightly lower than 74.4 reported in the paper).

Any chance I can get your tokenized AMR data? I have the access to LDC2017T10 if you need a proof of license. I'm curious what kind of words have been tokenized so that you can improve from 72.5 to 74.0.

ChunchuanLv commented 5 years ago

Hi Sheng,

It's here https://drive.google.com/open?id=1P5fksyX8klfHaPWRyIIlRAF2WE3T7tgm I think it is r2, but maybe you can double check. Also, I put the java preprocessing code here https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/tree/master/AMR_FEATURE

Chunchuan

On Fri, 1 Feb 2019 at 22:05, Sheng Zhang notifications@github.com wrote:

Hi Chunchuan. Do I understand the conversation here correctly? If I train your model on the original AMR data (not the tokenized version which you didn't make public), the model will only achieve 72.5 on LDC2017T10. Even if you reran on your tokenized version, you could only get 74.0 or 73.8 (which is slightly lower than 74.4 reported in the paper).

Any chance I can get your tokenized AMR data? I have the access to LDC2017T10 if you need a proof of license. I'm curious what kind of words have been tokenized so that you can improve from 72.5 to 74.0.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/issues/3#issuecomment-459884082, or mute the thread https://github.com/notifications/unsubscribe-auth/ADs1bT6WsfdLQxKh7hIdopaXYG-fhFrpks5vJLo0gaJpZM4W-cAA .

sheng-z commented 5 years ago

Great! Thanks for putting them together!

ghost commented 5 years ago

Did you or anybody else, by any chance, found out the reason for this tokenization-phenomenon?

I ran the model multiple times and it always gets me slighlty above 0.72 Smatch F1. It is a strong parser, but the number 0.744 in the paper seems a tiny bit optimistic to me.

ChunchuanLv commented 5 years ago

0.744 is obtained with the java based preprocessing starting with the tokenized version, the python preprocessing starting with raw text will give around 0.72.

Chunchuan

On Fri, 19 Apr 2019 at 10:04, hamlet-father-ghost notifications@github.com wrote:

Did you or anybody else, by any chance, found out what the reason for this weird phenomenon?

I ran the model multiple times and it always gets me slighlty above 0.72 Smatch F1. It is a strong parser, but the number 0.744 in the paper seems a tiny bit optimistic to me.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/issues/3#issuecomment-484818823, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5TK3IWOEPTELJYJ5JWKXLPRGDH7ANCNFSM4FXZYAAA .

ghost commented 5 years ago

Thank you. That's interesting. I am new to AMR parsing and thought that at test time you have to convert a sentence (no gold tokenisation) to the AMR graph. But if people calculate their AMR testing scores generally with gold tokenisation then it makes sense to put the optimistic number in the paper for comparison purposes. So to sum up is it correct to say your parser gets 0.744 with gold tokenisation and additional custom tokenisation, between 0.72 and 0.73 with gold tokenizing but no additional custom tokenizing and around 0.72 with Stanford NLP predicted tokenisation?Thanks for your patience.

KiroSummer commented 5 years ago

java based preprocessing starting with the tokenized version, the python preprocessing starting with raw

Hi, Chunchuan, I download your data from the google drive, and then preprocess the data with the java script. However, the model only get 73.7 F1 score in the dev data. And i train the model with the default settings and the java-preprocessed data. My question: what is the "tokenized version" you mentioned above? Is the data you released on the google drive "tokenized version"? Or, should i run the java preprocess script with some arguments?

Thanks for your patience.