AskNowQA / LC-QuAD

A data set of natural language queries with corresponding SPARQL queries
GNU General Public License v3.0
92 stars 30 forks source link

Corrections and basic grammar changes #9

Closed ram-g-athreya closed 4 years ago

ram-g-athreya commented 5 years ago

Hi

I had been using the LC-QuAD dataset as part of my thesis. While using it I had made some corrections based on Grammar or based on the intermediate question template.

I was hoping the changes could be inducted to the official dataset so that it improves the overall quality of the dataset.

Feel free to reach out regarding any issues or concerns in this regard.

Thanks Ram G Athreya

RicardoUsbeck commented 5 years ago

Hi @geraltofrivia , what do you think?

geraltofrivia commented 5 years ago

Thank you, @ram-g-athreya for all the efforts you've put in doing this. However, I discussed this internally and we're unsure about making suggested changes in the dataset. For two major reasons:

  1. There were some syntactic and lexical errors left in the dataset intentionally, to provide an additional challenge to systems.
  2. The current version has been used in multiple systems (soon to be incorporated on the leaderboard) which have benchmarked their performance on the dataset as it is right now.

What do you think? @ram-g-athreya @RicardoUsbeck

RicardoUsbeck commented 5 years ago

Hi, I think that the syntactical and lexical questions are not that much of a value, since this is normally fixed in a preprocessing step. However, your comment is worthwhile and makes sense, especially w.r.t. to the leaderboard http://lc-quad.sda.tech/ (shameless plug of an GERBIL QA http://gerbil-qa.aksw.org/gerbil/config integration here)

Thus, I suggest to release a 1.0 and a 1.fwe (fixed-writting-errors or something) version here on github and in the leaderboard and indicate that 1.0 is from you original ISWC paper and used in the leaderboard and 1.fwe contains fixes if you want to train your system on clean data.

By the way, for LC-QuAD 2.0 (which I am sure will come with way more questions, templates and challenges 🥇) one should think about the "data quality" also.

saist1993 commented 5 years ago

Releasing a fixed version for the train split does make sense. The test split can remain the same, but one can train on the fixed train version.

Gerbil sounds like a nice idea for evaluating over LC-QuAD. We were planning to integrate it from quite some time, but never got time to get it done. Maybe we can have a short discussion regarding the same sometime in the future.

The LC-QuAD 2.0 is in the pipeline. We have started the initial experiments but there is still a long way to go :angel:

geraltofrivia commented 5 years ago

Personally, I think this is the right way to go. If @ram-g-athreya and @RicardoUsbeck think this to be prudent, I shall decline this PR and instead wait for @ram-g-athreya to send in another with changes to the train data only?

RicardoUsbeck commented 5 years ago

Thanks for your replies.

Looking forward to your thoughts!

RicardoUsbeck commented 5 years ago

Excuse my continuous mumbling: I just remembered that we introduced syntactic and other user-driven mistakes in QALD-8 and the general response from the participants and the audience was that this was senseless. I have somewhere the results from a survey among 20 QALD participants from QALD 1 to QALD 8 which contains that. Not sure if that is helpful or guiding a direction.

gychant commented 4 years ago

Hi, I am wondering what is the status of correction mentioned above. Since more and more papers are using LC-QuAD 1.0 for evaluation, and those intentionally added mistakes make the data preprocessing stage tricky and less transparent, a cleaner version would help the community know what real performance those systems can achieve in terms of understanding the semantics of natural language queries. Thanks!

geraltofrivia commented 4 years ago

Hi @gychant, the authors of the paper (dataset) believe that any approach tackling the challenge should be able to handle malformed spellings, and minor noise in the questions. True, this makes the task more challenging, but we believe its a step in the right direction.

Based on the above, we decide that we should let these syntactic mistakes exist in the dataset. In both, train and test instances. The former enables statistical models to train on noisy data, which is generally thought to make them more robust to noise when deployed. Likewise, noisy test instances ensure that the performance on the dataset would be more representative of the performance of these approaches were they to be used by the general public.

In effect thus, we would not be merging the PR into the repo.

what real performance those systems can achieve

As mentioned above, the real performance shouldn't be thought of as an approach's ability to transform perfectly grammatically correct natural language to formal language instances, but instead, an estimation of these approaches when used by real users. We can not expect real users to write lexically and semantically correct questions, and thus, we believe that performance on our dataset as it stands would be a closer estimate to the real performance, than if we were to merge this PR.