Attempto / ACE-in-GF

Attempto Controlled English (http://attempto.ifi.uzh.ch/) in Grammatical Framework (https://www.grammaticalframework.org/)
54 stars 10 forks source link

Update to work with latest RGL + add PGF #14

Closed inariksit closed 3 years ago

inariksit commented 3 years ago

This compiles with the RGL commit 4f821ca621a418bba1a306b00063617307fba415 and the latest release of GF 3.11.

danshaub commented 3 years ago

Thank you again for working on this so quickly.

I've been able to repeat your compilation and have been working on a docker image so if someone else needs this same stuff, it'll work as long as docker hub exists.

My end goal is to get the script run-precision-test.bash to run so I can use it to generate a large number of attempt sentences. Working back on what depends on that script, I discovered that the files generated from running make-pgf.bash are necessary. This script in turn relies on the file words/clex/ClexAce.gf which is generated by running some kind of ACE lexicon written in prolog through the transpiler script: words/clex/clex_to_gf.pl (wrapped by words/clex/build.sh).

I was able to download the appropriate lexicon from https://github.com/Attempto/Clex/blob/master/clex_lexicon.pl and run it through the script but there was an error with gf command in the build.sh script.

gf +RTS -K100M -RTS --preproc=mkPresent --make --optimize-pgf --mk-index --path $path Clex*.gf
unrecognized option `--mk-index'

You may want to try --help.

I wasn't sure how to modify that command correctly, so I went on to the make-pgf.bash script. Sadly, there were a few errors within grammar/ace/SymbolsACE.gf that the compiler spit back. Here's the output:

Making output directories (if needed)
Building PGF from:
words/clex/ClexAce.gf
grammars/ace/SymbolsACE.gf:8:
  Happened in the renaming of times_Term
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:5:
  Happened in the renaming of plus_Term
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:9:
  Happened in the renaming of neg_Term
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:17:
  Happened in the renaming of ne_Formula
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:6:
  Happened in the renaming of minus_Term
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:13:
  Happened in the renaming of lt_Formula
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:15:
  Happened in the renaming of le_Formula
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:10:
  Happened in the renaming of int_Term
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:14:
  Happened in the renaming of gt_Formula
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:16:
  Happened in the renaming of ge_Formula
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:12:
  Happened in the renaming of eq_Formula
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:7:
  Happened in the renaming of div_Term
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
   constant not found: Term
   given Symbols, SymbolsACE
done.

I have very little experience with GF, so I wasn't able to fix these errors. If you get a chance, could you take a look at this? I'll make a pull request to your update-RGL branch with what I was able to do.

Thank you :)

inariksit commented 3 years ago

Ok, that's helpful to know what exactly you're trying to do! :grin: I'll take a look at the SymbolsACE module next.

I am also running into problems with unrecognised flags when trying to run the old scripts. My first suggestion would be just to try that command again without the flag—if it was removed, it probably wasn't anything crucial. For instance, the --mk-index seems to have been an old optimisation flag, to produce a smaller PGF. But it will produce an equally functional PGF even without the flag.

inariksit commented 3 years ago

@danshaub Now the file ACE_0_0_2.pgf compiles, and I get even a bit more interesting text :grin:

be able to have yourself ! be him / her , not everything !
who doesn't it have ? have not everything , he and they !

Thanks for committing the Clex files. However, the concrete syntax ClexAce has no linearisations, so there's still no lexicon. It also had a syntax error, which I fixed to make it compile, but since there are no content words, all I can generate is sentences with function words and pronouns.

I didn't try to do anything more, just to get make-pgf.bash working. Again, if you run into further problems, just report here and I see what I can do! :slightly_smiling_face:

danshaub commented 3 years ago

@inariksit I was also able to get some sentences to generate! I think I'll be able to add linearizations for all the words in the Clex files just by emulating the linearizations in LexACE.gf and similar files. I'll be working on that today and submit a pull request when I'm able to get it to work.

Thanks again for all the help :)

danshaub commented 3 years ago

@inariksit I still haven't quite been able to get things rolling with the Clex files. The only progress I've been able to make is modifying words/clex/build.sh so that it includes the acewiki_aceowl grammar in the path of the gf command:

clex='./clex_lexicon.pl'

ace="../../lib/src/ace/"
api="../../lib/src/api/"
grammar="../../grammars/ace/:../../grammars/acewiki_aceowl/"

path="present:${grammar}:${ace}:${api}"

# swipl -f none -g "main('$clex')" -t halt -s clex_to_gf.pl

gf +RTS -K100M -RTS --preproc=mkPresent --make --optimize-pgf --path $path Clex*.gf

Running the compilation only resulted in a partially successful compilation, the only errors being the lack of linearizations for Clex.gf and a series of strange errors saying there's a conflict between CatEng.gf and ACEAce.gf or AttemptoAce.gf:

ClexAce.gf:7:
  Happened in the renaming of aceV3
   Warning: atomic term V3
            conflict CatEng.V3, ACEAce.V3
            given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
                  CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:6:
  Happened in the renaming of aceV2
   Warning: atomic term V2
            conflict CatEng.V2, AttemptoAce.V2
            given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
                  CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:5:
  Happened in the renaming of aceV
   Warning: atomic term V
            conflict CatEng.V, AttemptoAce.V
            given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
                  CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:9:
  Happened in the renaming of aceA2
   Warning: atomic term A2
            conflict CatEng.A2, AttemptoAce.A2
            given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
                  CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:8:
  Happened in the renaming of aceA
   Warning: atomic term A
            conflict CatEng.A, AttemptoAce.A
            given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
                  CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce

Could you take a look at these errors?

Also, for making linearizations of the contents in Clex.gf, would simply adding something along the lines of possible_A = mkA "possible" ; to ClexAce.gf work (given a corresponding line possible_A : A ; within Clex.gf)?

inariksit commented 3 years ago

Could you take a look at these errors?

Those are warnings, not errors. It's saying that there are two cats called A in scope, one from the CatEng module and other from the AttemptoAce module. If it produced a working PGF file, then it can be ignored. However, if you want to remove the warning, you can change ClexAce to be like this:

aceV : (_,_:Str) -> AttemptoAce.V = \go,goes -> mkV go goes "~" "~" "~";
aceV2 : (_,_,_:Str) -> AttemptoAce.V2 = \go,goes,gone -> mkV2 (mkV go goes "~" gone "~");
aceV3 : (_,_,_,_:Str) -> ACEAce.V3 = \go,goes,gone,prep -> mkV3 (mkV go goes "~" gone "~") (mkPrep prep);
aceA : (_,_,_:Str) -> AttemptoAce.A = \good,better,best -> mkA good better best "~";
aceA2 : (_,_,_:Str) -> AttemptoAce.A2 = \good,better,best -> mkA2 (aceA good better best) "";
}

i.e. explicitly naming that you want to use the category A from AttemptoAce.

Also, for making linearizations of the contents in Clex.gf, would simply adding something along the lines of possible_A = mkA "possible" ; to ClexAce.gf work (given a corresponding line possible_A : A ; within Clex.gf)?

Yes, you will need a corresponding lin in the concrete for every fun in the abstract. Except that I believe it should be using the aceA, aceV etc. constructors instead of mkA, mkV etc. @Kaljurand maybe has more insight?

danshaub commented 3 years ago

I spent some time trying to get the linearizations to work with no luck. Some word types didn't have ace versions of their constructors (and others still didn't have any constructors at all). Those that did, namely aceV, aceV2, aceV3, aceA, and aceA2 still gave errors while compiling. They were all in the form:

ClexAce.gf:
   ClexAce.gf:55354:
     Happened in linearization of trim_off_V2
       type of aceV2 ("trim" ++ "off" ++ [])
      expected: {s : ResEng.VForm => Str; c2 : Str;
                 isRefl : Prelude.Bool; lock_V2 : {}; p : Str}
      inferred: Str -> Str -> {s : ResEng.VForm => Str; c2 : Str;
                               isRefl : Prelude.Bool; lock_V2 : {}; p : Str}

   ** Maybe you gave too few arguments to aceV2

ClexAce.gf:
   ClexAce.gf:11:
     Happened in linearization of Fahrenheit_A
       type of aceA "Fahrenheit"
      expected: {s : ResEng.AForm => Str; isMost : Prelude.Bool;
                 isPre : Prelude.Bool; lock_A : {}}
      inferred: Str -> Str -> {s : ResEng.AForm => Str;
                               isMost : Prelude.Bool; isPre : Prelude.Bool; lock_A : {}}

   ** Maybe you gave too few arguments to aceA

etc.

I'm not sure what the other arguments should be if that error message is to be believed or how to parse the expected type. Any insights you can give on this?

The good news is that there are no more warnings from the overload of mkV, etc. that used to be there! So thank you for that fix :)

inariksit commented 3 years ago

Hi! I was on holiday last week, and only read your reply now.

The error, "maybe you gave too few arguments to ace" is just what it says: the aceA etc. constructor expects two or more arguments (e.g. aceV "sleep" "sleeps"), but you gave only one (aceV "sleep"). Sorry if my suggestion was misleading, I was just guessing based on those constructors existing. But now that I look at the definitions, the `aceconstructors are just calling the standardmk*` constructors, so it will definitely compile if you use them directly.

Furthermore, all those standard constructors have a one-argument version, so you can use them on just the dictionary form of the word—it will compile, and the result will be right in most of the cases, but wrong with any irregular words (e.g. child-children, sing-sang-sung). In addition, multiwords will need more attention: e.g. to get "trim off" work properly, you need this:

mkV2 (partV (mkV "trim") "off") ;

That will give the proper inflection table,

s = table {
      VInf => "trim"
      VPres => "trims"
      VPPart => "trimmed"
      VPresPart => "trimming"
      VPast => "trimmed" }
p = "off"

Whereas mkV2 "trim off" will give you a table like "trim offs, trim offed, …".

Given these additional complications, I wonder what the original script to produce the lexicon contained. Did it simply produce a lexicon that was knowingly wrong but "good enough", or were these special cases taken into account?

inariksit commented 3 years ago

Anyway, the short answer is:

You will need to look at the documentation to see what is and isn't in the smart paradigms for English: http://www.grammaticalframework.org/lib/doc/synopsis/index.html#toc94

danshaub commented 3 years ago

I didn't think much of it at the time, but the script used to generate Clex.gf and ClexAce.gf from the original prolog lexicon was broken. You mentioning the script made me want to take a second look at it. I spent some time today to get it working again and lo and behold, everything compiled! It turns out, something weird was going on with a built-in prolog function (append/2) and it was halting when attempting to concatenate some strings. I swapped that with concat/3 and that happened to work.

There were a few terms that didn't get linearizations but that isn't a huge deal. The only issue now is that since the lexicon is so massive, the gr and gt commands seem to stall (so I guess there's always something!) In any case, the new commits I made are in the same pull request I have open to your fork.

Thank you so much for the help!

Kaljurand commented 3 years ago

Regarding the size of Clex, you could use the smaller version that is distributed together with APE, e.g. the build script could just download it from GitHub:

echo "Downloading Clex..."
#curl -L https://raw.github.com/Attempto/Clex/master/clex_lexicon.pl > clex.pl
# Smaller version of Clex, which is distributed together with APE
curl -L https://raw.github.com/Attempto/APE/master/prolog/lexicon/clex_lexicon.pl > clex.pl

echo "Converting Clex to GF..."
swipl -f none -g "main('clex')" -t halt -s clex_to_gf.pl

Even better would be to download it as part of the clex_to_gf.pl script, i.e. in Prolog (for an example of how to do that, see ensure_clex in https://github.com/Attempto/APE/blob/master/tests/downloader.pl).

(In the final merge request I'd like to exclude large files like the Clex.)

danshaub commented 3 years ago

I'll look into downloading the lexicon within clex_to_gf.pl

In what I've done so far, Grammatical Framework doesn't seem to like the degree symbol included in mn_pl('°C', '°C') and mn_sg('°C', '°C'), so I'll have to make sure those lines are excluded when the script is run. I should have time to do that tomorrow.

I've been working with the smaller lexicon since I got your response and even with the smaller lexicon, building trees from baseText (sText ? )) seems to get stuck very frequently. Does either of you know what causes the gr function to spin like that? I've tried setting the max depth as low as 4 and it still doesn't work consistently.

I'll push what I have now to my fork. I'm using the make-pgf.sh script to build with the Clex files.

inariksit commented 3 years ago

the gr and gt commands seem to stall

Yes, the gt command will definitely be unhappy about such a big lexicon and grammar :sweat_smile: Consider a tree at a depth baseText (impVP (a2VP mad_about_A2 (everyNP (cn_as_VarCN (adjCN (comparAP young_A) zip_code_N))))) ("be mad-about- every younger zip-code"), there are 3301068879561 trees up to that size.

To take a slightly larger tree, consText (sText (s (if_thenS (falseS (thereNP (someCollNP girl_N))) (thereNP (someCollNP girl_N))))) (baseText (impVP (vVP accept_V))) ("if it is false that there are some girls then there are some girls . do accept !"), there are 106725330795495054412 possible trees up to that size. So gt for the whole grammar is out of the question :grin:

But I'm surprised that gr is also slow. I tested it myself, and sometimes it generates a sentence immediately, sometimes it takes a long time, but pressing Ctrl+C kills it immediately. I don't know why this happens.

(Offtopic: how did I get those big numbers? By running gftest with the --count-trees flag—it's using this magical enumeration stuff so it can just count the trees without actually generating them.)

So is this the next issue you're facing? Would you like to have a systematic generation of ACE examples that is exhaustive by some other metric than "produce literally infinite amount of trees"? I may be able to help you with that, but we need to step out of the GF shell to do that.

In what I've done so far, Grammatical Framework doesn't seem to like the degree symbol included in mn_pl('°C', '°C') and mn_sg('°C', '°C'), so I'll have to make sure those lines are excluded when the script is run.

Yes, if you try to make a GF identifier that includes Unicode, you need to wrap the whole identifier name in '', like 'å_N' : N ;.

danshaub commented 3 years ago

So is this the next issue you're facing? Would you like to have a systematic generation of ACE examples that is exhaustive by some other metric than "produce literally infinite amount of trees"? I may be able to help you with that, but we need to step out of the GF shell to do that.

@inariksit Well, I certainly don't need an infinite number of sentences haha! I also don't really need the example sentences to be exhaustive in any way except for maybe each word in the lexicon making an appearance. My end goal with this is to use the sentences as data for a Neural Network, so about 1,000,000 sentences all with similar lengths to those in tests/ace/sentences.txt would be ideal. I'm curious what tools are available outside the GF shell

Also, I noticed that many times, even with relatively low max depths, sentences just become jumbles of nothing that are upwards of 20 words long. Is there a way to limit the sentence length as well as the max depth?


I probably should have described my project in more detail earlier, but better late than never! I'm working on a system that automatically "translates" natural English into ACE similar to what the researchers in this paper: Rewriting simplified text into a controlled natural language. Whereas this team decided to use a rule-based system, I want to build a system that is trained like an autoencoder but where the output necessarily conforms to the grammar of ACE. I haven't committed to any system architecture yet, but I'll most likely use a seq2seq model with attention.

The example sentences I want to generate will be used as both the input and target output of this model during training. This is my workaround for the lack of a dataset of pairs of Natural English and ACE where the Natural English couldn't be directly parsed by APE. Since every ACE sentence is also a valid English sentence, this solution seems like the best option that is also feasible.

Lastly, I'm aware that the gr function in the GF shell allows for probability weights, do you think I would be able to leverage that in the decoder for my system?

inariksit commented 3 years ago

I'm curious what tools are available outside the GF shell

Not a single, pre-existing tool that does what you want, I was just thinking of writing some custom code using the PGF library.

For instance, if you would be happy to take the 2700 sentences in sentences.txt as templates, and plug different words in them, so that in the end you use the whole lexicon, that would be quite feasible to do: I believe generateFrom takes an expression with metavariable (sort of a hole in the tree), and generates stuff in the hole. So if the hole is of a, say, noun, then it would put every noun in the hole.

Also, I noticed that many times, even with relatively low max depths, sentences just become jumbles of nothing that are upwards of 20 words long. Is there a way to limit the sentence length as well as the max depth?

There's no direct mechanism for that. The PGF library (as well as the GF shell) work on tree depth, and the gftest tool works on the number of constructors, but neither of them has any access to the number of words in the linearisation. Someone had a similar problem, and Aarne wrote a grammar that does exactly 3-word sentences here. Of course, the 3-word grammar itself is not usable to you, but the discussion can still be interesting to read, to understand better the technical limitations.

Now that I know your scope, I would actually suggest that you write about your project on the GF mailing list and ask for suggestions! I know that other people have used GF in data augmentation, so they would probably have some insights that I don't have.

inariksit commented 3 years ago

@danshaub Are we btw done with the changes in the grammars? If so, I will clean up (e.g. remove the useless PGF from 80e55da and lexicon files). Just want to confirm with you so I won't mess up your stuff by force pushing.

@Kaljurand you said you want to exlude the generated lexicon files, that's fine–we'll just push the updated script(s)! How about the PGF file generated by make-pgf.bash?

Kaljurand commented 3 years ago

@inariksit I'd prefer to create a tag, make a corresponding release, and attach all larger generated files (e.g. PGFs, GF source for the lexica, etc.) to this release, but otherwise keep them out of the repository.

danshaub commented 3 years ago

@inariksit We should be done changing the grammars now :) feel free to remove any useless PFG files, etc. The pull request I made should have everything, so make sure to merge that before deleting anything.

Not a single, pre-existing tool that does what you want, I was just thinking of writing some custom code using the PGF library.

For instance, if you would be happy to take the 2700 sentences in sentences.txt as templates, and plug different words in them, so that in the end you use the whole lexicon, that would be quite feasible to do: I believe generateFrom takes an expression with metavariable (sort of a hole in the tree), and generates stuff in the hole. So if the hole is of a, say, noun, then it would put every noun in the hole.

That sounds like a great idea! Would you be willing to meet live to discuss this? It's clear we're on different time zones, but I wouldn't mind meeting in the evening in my time zone. I am in Pacific Standard Time and would be fine to meet any time before about 11:00 PM in that time zone.

@Kaljurand That pull request has some scripts that directly download and build with the lexicons in the Attempto repositories. I also included a script that downloads a fork of the Clex repo that just has all the words with accents removed so they don't potentially mess with grammatical framework. I was able to reuse your prolog download code but I ran into a strange error in that any lexicon downloaded from within prolog would replace the degree symbol '°' with an unknown character placeholder so I included a curl command within the bash scripts. My most recent commit should have the clex folder all cleaned up :)

inariksit commented 3 years ago

@danshaub sure! I'll get back to you when I have a suggestion for a time.

inariksit commented 3 years ago

I've cleaned up the commits now, I hope I didn't accidentally remove something crucial. @danshaub would you like to test that everything still works?

@Kaljurand I think someone else than me can do the release–I am really not sure what to include, I just like fixing GF issues. :-P Would @danshaub be interested in doing that?

Kaljurand commented 3 years ago

Thanks @inariksit and @danshaub ! I've been busy with vacationing so far, but will start looking into this pull request in the coming days, and will try to merge it. Let me know if you still want to push something before the merge.

Kaljurand commented 3 years ago

I think this pull request can be merged. It achieves its main goals:

Thanks a lot! :)

Just for the record, here is a list of things that I noticed during my testing, and that could be fixed in future checkins (e.g. by me):

The general task is to review how close is (the new) ACE-in-GF actually to ACE. It has never been 100% compatible, of course, that's why the various experiments with subsets like OWL and Codeco, but maybe the recent changes have introduced incompatibilities that could be easily avoided. E.g. "gr | l" generates:

recreate yourself , nothing Z !

while correct ACE would be:

something Z , recreate yourself !

This should also be kept in mind then randomly generating ACE sentences with the ACE-in-GF grammar, e.g. it would make sense to control the generation in various ways:

  1. limit the content word lexicon to ~10 meta words like "SingularNoun", "PluralVerb", ...
  2. change the grammar to remove some top level constructs that one might not be interested in, e.g. commands (i.e. the things that end with "!")
  3. generate with the resulting grammar, but keep only the sentences that are accepted by the official ACE parser (https://github.com/Attempto/APE)
  4. use the chosen sentences as templates, where the content word place holders are filled with the actual words (e.g. from Clex) to generate the final training data
inariksit commented 3 years ago

document how to get "ghc --make -o Parser Parser.hs" etc. to succeed (currently complains that: Could not find module ‘PGF’)

I would recommend using Stack: add a package.yaml and stack.yaml files to the repo, and link to https://docs.haskellstack.org/en/stable/install_and_upgrade/ or just paste the instructions on how to install Stack.

danshaub commented 3 years ago

@Kaljurand I'd be happy to help out with any of those future updates now that I'm a bit more integrated into GF and Attempto. Just reach out at any time :)

Both of you, thank you so much for the help in getting this all working! I'll be sure to keep you updated on the progress of my project as it continues.