Sempre Model Training - Githubissues

GindaChen commented 3 years ago

The sempre does not seem to provide a pre-trained mode (I will ask them soon) to utilize their FreeBase copy directly. Instead, we will need to train on the dataset ourselves.

There is an active training process on the cloudlab machine, ultimately training with 6 iterations, each with 512 examples using their given dataset (called free917).

There are a few problems:

Training is very slow. They seem to provide a single-threaded trainer to train the model (I could be wrong as there are just too many parameters to set). It's running 1 hour with 50 examples only... We will see if there is a way to accelerate the training process, otherwise, we will just have to wait for another 50+ hours for the model to come hot.
Parsing will be hard. For example, suppose we have an utterance
```
Which characters are playable in sonic rush
```
Our expected target result (i.e. the answer in the training set) is the following:
```
(!fb:cvg.game_performance.character ((lambda x (fb:cvg.game_performance.game (var x))) fb:en.sonic_rush))
```
But Sempre will send us back some logical derivations as follows:

```bash [score=87.654, prob=1.000, comp=1] (derivation (formula ((lambda x (!fb:cvg.game_performance.character (!fb:cvg.computer_videogame.characters (var x)))) fb:en.sonic_rush)) (value (list (name fb:en.tails "Miles \"Tails\" Prower") (name fb:en.knuckles_the_echidna "Knuckles the Echidna") (name fb:en.amy_rose "Amy Rose") (name fb:en.dr_robotnik "Dr. Eggman") (name fb:m.0ck8nbm "Egg Pawn") (name fb:en.sonic_the_hedgehog "Sonic the Hedgehog") (name fb:en.eggman_nega "Eggman Nega") (name fb:en.super_sonic "Super Sonic") (name fb:en.cream_the_rabbit "Cream the Rabbit") (name fb:m.0cj6nxm "Burning Blaze"))) (type fb:cvg.game_character)) [score=75.392, prob=4.73e-06, comp=1] (derivation (formula ((lambda x (!fb:cvg.game_performance.character (!fb:cvg.computer_videogame.characters (var x)))) fb:en.sonic_rush)) (value (list (name fb:en.tails "Miles \"Tails\" Prower") (name fb:en.knuckles_the_echidna "Knuckles the Echidna") (name fb:en.amy_rose "Amy Rose") (name fb:en.dr_robotnik "Dr. Eggman") (name fb:m.0ck8nbm "Egg Pawn") (name fb:en.sonic_the_hedgehog "Sonic the Hedgehog") (name fb:en.eggman_nega "Eggman Nega") (name fb:en.super_sonic "Super Sonic") (name fb:en.cream_the_rabbit "Cream the Rabbit") (name fb:m.0cj6nxm "Burning Blaze"))) (type fb:cvg.game_character)) ```

These entities does not guarantee the original text will be presented in the derivation, nor will it guarantee they are indeed feasible.

GindaChen commented 3 years ago

Looking back at the original text:

To map English sentences to query sketches, we have implemented our own semantic parser on top of the Sempre ... For the linguistic processor, we leverage the pre-trained models of the Stanford CoreNLP [Manning et al. 2014] library for part-of-speech tagging and named entity recognition.

It really sounds like SQLizer wasn't directly using Sempre to do semantic parsing, but that isn't really possible because of how sempre is structured. Either they inject the grammar into sempre, or they are just simply using the stanford corenlp without any of the logical features. Also,

Given an utterance u, our parser generates all possible query sketches S i and assigns each S i a score that indicates the likelihood that S i is the intended interpretation of u. This score is calculated based on a set of pre-defined features. More precisely, given an utterance u and weight vector w, the parser maps each query sketch S i to a d-dimensional feature vector ϕ(u, S i ) ∈ R d and computes the likelihood score for S i as the weighted sum of its features Sqlizer uses approximately 40 features that it inherits from the Sempre framework. Examples of features include the number of grammar rules used in the derivation, the length of the matched input, whether a particular rule was used in the derivation, the number of skipped words in a part-of-speech tag etc.

Sempre never associate the score with features other than (1) the number of occurance / some weighted frequency in the training set and (2) the calculated likelihood. I'm now wondering what these 40 features should represent here that they had described.

GindaChen commented 3 years ago

@chenhao-ye @ShawnZhong I just sent an email to one of the author (Wang). Let's see if we can get a response after Thanksgiving. Hopefully I don't have to wrap my nerve around this for too long.

Some other references at how other people train this:

GindaChen commented 3 years ago

After a lot of trial and error, we might have to give up the Sempre path for the semantic parser. There are two reasons in our case:

Sempre does not output a derivation tree. Sempre can only output the final derived query, but not the intermediate step to get there (so far as I see - but see if this issue can give us some light). As a result, we are not able to map the original entity back to the NL phrases and generate "hints".
- One part of the relies on the LexiconFn which takes in a NL phrase and generate its category / entity representation inside Freebase (a known knowledge graph). I have tried to hack this such that it can return the pair of (original, mapped entity) back to us, but sadly we couldn't.
- I have not succeed in any way to generate the grammar if we bypass the LexiconFn and try to make sense of our own entity relation. That is just not a very promising path to go.
Sempre is not a complete Lisp. Sempre mock the Lisp language - in fact, the grammar rules are all in Lisp. However, you can't even construct a closure or a list in the grammar language. This is exactly what we wish to have - to let the leaf node return a list / closure such that one day we can fetch the information at the root level.

These two reasons determine that we can't extend Sempre to anywhere else other than some simple hand-craft rigid grammar. I really wonder how much code did the SQLizer guys modified to make the Sempre work as expected.

GindaChen commented 3 years ago

Problem Solved. Now we are using our own grammar file for the tests

GindaChen / cs703-sqlizer

Sempre Model Training #17