Closed inariksit closed 3 years ago
Thank you again for working on this so quickly.
I've been able to repeat your compilation and have been working on a docker image so if someone else needs this same stuff, it'll work as long as docker hub exists.
My end goal is to get the script run-precision-test.bash
to run so I can use it to generate a large number of attempt sentences. Working back on what depends on that script, I discovered that the files generated from running make-pgf.bash
are necessary. This script in turn relies on the file words/clex/ClexAce.gf
which is generated by running some kind of ACE lexicon written in prolog through the transpiler script: words/clex/clex_to_gf.pl
(wrapped by words/clex/build.sh
).
I was able to download the appropriate lexicon from https://github.com/Attempto/Clex/blob/master/clex_lexicon.pl and run it through the script but there was an error with gf command in the build.sh
script.
gf +RTS -K100M -RTS --preproc=mkPresent --make --optimize-pgf --mk-index --path $path Clex*.gf
unrecognized option `--mk-index'
You may want to try --help.
I wasn't sure how to modify that command correctly, so I went on to the make-pgf.bash
script. Sadly, there were a few errors within grammar/ace/SymbolsACE.gf
that the compiler spit back. Here's the output:
Making output directories (if needed)
Building PGF from:
words/clex/ClexAce.gf
grammars/ace/SymbolsACE.gf:8:
Happened in the renaming of times_Term
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:5:
Happened in the renaming of plus_Term
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:9:
Happened in the renaming of neg_Term
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:17:
Happened in the renaming of ne_Formula
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:6:
Happened in the renaming of minus_Term
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:13:
Happened in the renaming of lt_Formula
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:15:
Happened in the renaming of le_Formula
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:10:
Happened in the renaming of int_Term
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:14:
Happened in the renaming of gt_Formula
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:16:
Happened in the renaming of ge_Formula
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:12:
Happened in the renaming of eq_Formula
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
grammars/ace/SymbolsACE.gf:7:
Happened in the renaming of div_Term
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
constant not found: Term
given Symbols, SymbolsACE
done.
I have very little experience with GF, so I wasn't able to fix these errors. If you get a chance, could you take a look at this? I'll make a pull request to your update-RGL branch with what I was able to do.
Thank you :)
Ok, that's helpful to know what exactly you're trying to do! :grin: I'll take a look at the SymbolsACE module next.
I am also running into problems with unrecognised flags when trying to run the old scripts. My first suggestion would be just to try that command again without the flag—if it was removed, it probably wasn't anything crucial. For instance, the --mk-index
seems to have been an old optimisation flag, to produce a smaller PGF. But it will produce an equally functional PGF even without the flag.
@danshaub Now the file ACE_0_0_2.pgf compiles, and I get even a bit more interesting text :grin:
be able to have yourself ! be him / her , not everything !
who doesn't it have ? have not everything , he and they !
Thanks for committing the Clex files. However, the concrete syntax ClexAce has no linearisations, so there's still no lexicon. It also had a syntax error, which I fixed to make it compile, but since there are no content words, all I can generate is sentences with function words and pronouns.
I didn't try to do anything more, just to get make-pgf.bash
working. Again, if you run into further problems, just report here and I see what I can do! :slightly_smiling_face:
@inariksit I was also able to get some sentences to generate! I think I'll be able to add linearizations for all the words in the Clex files just by emulating the linearizations in LexACE.gf
and similar files. I'll be working on that today and submit a pull request when I'm able to get it to work.
Thanks again for all the help :)
@inariksit I still haven't quite been able to get things rolling with the Clex files. The only progress I've been able to make is modifying words/clex/build.sh
so that it includes the acewiki_aceowl
grammar in the path of the gf command:
clex='./clex_lexicon.pl'
ace="../../lib/src/ace/"
api="../../lib/src/api/"
grammar="../../grammars/ace/:../../grammars/acewiki_aceowl/"
path="present:${grammar}:${ace}:${api}"
# swipl -f none -g "main('$clex')" -t halt -s clex_to_gf.pl
gf +RTS -K100M -RTS --preproc=mkPresent --make --optimize-pgf --path $path Clex*.gf
Running the compilation only resulted in a partially successful compilation, the only errors being the lack of linearizations for Clex.gf
and a series of strange errors saying there's a conflict between CatEng.gf
and ACEAce.gf
or AttemptoAce.gf
:
ClexAce.gf:7:
Happened in the renaming of aceV3
Warning: atomic term V3
conflict CatEng.V3, ACEAce.V3
given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:6:
Happened in the renaming of aceV2
Warning: atomic term V2
conflict CatEng.V2, AttemptoAce.V2
given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:5:
Happened in the renaming of aceV
Warning: atomic term V
conflict CatEng.V, AttemptoAce.V
given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:9:
Happened in the renaming of aceA2
Warning: atomic term A2
conflict CatEng.A2, AttemptoAce.A2
given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
ClexAce.gf:8:
Happened in the renaming of aceA
Warning: atomic term A
conflict CatEng.A, AttemptoAce.A
given ParadigmsAce, SyntaxAce, SymbolsC, SymbolsACEC, CommonX,
CatEng, NumeralEng, NumeralAce, AttemptoAce, ACEAce, ClexAce
Could you take a look at these errors?
Also, for making linearizations of the contents in Clex.gf
, would simply adding something along the lines of possible_A = mkA "possible" ;
to ClexAce.gf
work (given a corresponding line possible_A : A ;
within Clex.gf
)?
Could you take a look at these errors?
Those are warnings, not errors. It's saying that there are two cats called A in scope, one from the CatEng module and other from the AttemptoAce module. If it produced a working PGF file, then it can be ignored. However, if you want to remove the warning, you can change ClexAce to be like this:
aceV : (_,_:Str) -> AttemptoAce.V = \go,goes -> mkV go goes "~" "~" "~";
aceV2 : (_,_,_:Str) -> AttemptoAce.V2 = \go,goes,gone -> mkV2 (mkV go goes "~" gone "~");
aceV3 : (_,_,_,_:Str) -> ACEAce.V3 = \go,goes,gone,prep -> mkV3 (mkV go goes "~" gone "~") (mkPrep prep);
aceA : (_,_,_:Str) -> AttemptoAce.A = \good,better,best -> mkA good better best "~";
aceA2 : (_,_,_:Str) -> AttemptoAce.A2 = \good,better,best -> mkA2 (aceA good better best) "";
}
i.e. explicitly naming that you want to use the category A from AttemptoAce.
Also, for making linearizations of the contents in Clex.gf, would simply adding something along the lines of possible_A = mkA "possible" ; to ClexAce.gf work (given a corresponding line possible_A : A ; within Clex.gf)?
Yes, you will need a corresponding lin
in the concrete for every fun
in the abstract. Except that I believe it should be using the aceA, aceV
etc. constructors instead of mkA, mkV
etc. @Kaljurand maybe has more insight?
I spent some time trying to get the linearizations to work with no luck. Some word types didn't have ace versions of their constructors (and others still didn't have any constructors at all). Those that did, namely aceV, aceV2, aceV3, aceA, and aceA2 still gave errors while compiling. They were all in the form:
ClexAce.gf:
ClexAce.gf:55354:
Happened in linearization of trim_off_V2
type of aceV2 ("trim" ++ "off" ++ [])
expected: {s : ResEng.VForm => Str; c2 : Str;
isRefl : Prelude.Bool; lock_V2 : {}; p : Str}
inferred: Str -> Str -> {s : ResEng.VForm => Str; c2 : Str;
isRefl : Prelude.Bool; lock_V2 : {}; p : Str}
** Maybe you gave too few arguments to aceV2
ClexAce.gf:
ClexAce.gf:11:
Happened in linearization of Fahrenheit_A
type of aceA "Fahrenheit"
expected: {s : ResEng.AForm => Str; isMost : Prelude.Bool;
isPre : Prelude.Bool; lock_A : {}}
inferred: Str -> Str -> {s : ResEng.AForm => Str;
isMost : Prelude.Bool; isPre : Prelude.Bool; lock_A : {}}
** Maybe you gave too few arguments to aceA
etc.
I'm not sure what the other arguments should be if that error message is to be believed or how to parse the expected type. Any insights you can give on this?
The good news is that there are no more warnings from the overload of mkV, etc. that used to be there! So thank you for that fix :)
Hi! I was on holiday last week, and only read your reply now.
The error, "maybe you gave too few arguments to ace" is just what it says: the aceA
etc. constructor expects two or more arguments (e.g. aceV "sleep" "sleeps"
), but you gave only one (aceV "sleep"
). Sorry if my suggestion was misleading, I was just guessing based on those constructors existing. But now that I look at the definitions, the `aceconstructors are just calling the standard
mk*` constructors, so it will definitely compile if you use them directly.
Furthermore, all those standard constructors have a one-argument version, so you can use them on just the dictionary form of the word—it will compile, and the result will be right in most of the cases, but wrong with any irregular words (e.g. child-children, sing-sang-sung). In addition, multiwords will need more attention: e.g. to get "trim off" work properly, you need this:
mkV2 (partV (mkV "trim") "off") ;
That will give the proper inflection table,
s = table {
VInf => "trim"
VPres => "trims"
VPPart => "trimmed"
VPresPart => "trimming"
VPast => "trimmed" }
p = "off"
Whereas mkV2 "trim off"
will give you a table like "trim offs, trim offed, …".
Given these additional complications, I wonder what the original script to produce the lexicon contained. Did it simply produce a lexicon that was knowingly wrong but "good enough", or were these special cases taken into account?
Anyway, the short answer is:
If you just want it to compile, use the 1-argument smart paradigms from the standard GF RGL: just put mkN/mkA/mkV/mkV2/…
where needed. This will not make it ACE (e.g. no hyphens), and it may be inflected wrong.
If you need to produce a word in a category that doesn't have a 1-argument smart paradigm, you can make your custom wrapper oper. For instance, A2 doesn't have a mkA2 : Str -> A2
in the RGL, but you can add this in your file:
oper myA2 : Str -> A2 = \str -> mkA2 (mkA str) noPrep ;
and then use your wrapper whenever creating a lexicon entry of type A2
.
You will need to look at the documentation to see what is and isn't in the smart paradigms for English: http://www.grammaticalframework.org/lib/doc/synopsis/index.html#toc94
I didn't think much of it at the time, but the script used to generate Clex.gf and ClexAce.gf from the original prolog lexicon was broken. You mentioning the script made me want to take a second look at it. I spent some time today to get it working again and lo and behold, everything compiled! It turns out, something weird was going on with a built-in prolog function (append/2) and it was halting when attempting to concatenate some strings. I swapped that with concat/3 and that happened to work.
There were a few terms that didn't get linearizations but that isn't a huge deal. The only issue now is that since the lexicon is so massive, the gr and gt commands seem to stall (so I guess there's always something!) In any case, the new commits I made are in the same pull request I have open to your fork.
Thank you so much for the help!
Regarding the size of Clex, you could use the smaller version that is distributed together with APE, e.g. the build script could just download it from GitHub:
echo "Downloading Clex..."
#curl -L https://raw.github.com/Attempto/Clex/master/clex_lexicon.pl > clex.pl
# Smaller version of Clex, which is distributed together with APE
curl -L https://raw.github.com/Attempto/APE/master/prolog/lexicon/clex_lexicon.pl > clex.pl
echo "Converting Clex to GF..."
swipl -f none -g "main('clex')" -t halt -s clex_to_gf.pl
Even better would be to download it as part of the clex_to_gf.pl script, i.e. in Prolog (for an example of how to do that, see ensure_clex
in https://github.com/Attempto/APE/blob/master/tests/downloader.pl).
(In the final merge request I'd like to exclude large files like the Clex.)
I'll look into downloading the lexicon within clex_to_gf.pl
In what I've done so far, Grammatical Framework doesn't seem to like the degree symbol included in mn_pl('°C', '°C') and mn_sg('°C', '°C'), so I'll have to make sure those lines are excluded when the script is run. I should have time to do that tomorrow.
I've been working with the smaller lexicon since I got your response and even with the smaller lexicon, building trees from baseText (sText ? )) seems to get stuck very frequently. Does either of you know what causes the gr function to spin like that? I've tried setting the max depth as low as 4 and it still doesn't work consistently.
I'll push what I have now to my fork. I'm using the make-pgf.sh
script to build with the Clex
files.
the gr and gt commands seem to stall
Yes, the gt
command will definitely be unhappy about such a big lexicon and grammar :sweat_smile: Consider a tree at a depth baseText (impVP (a2VP mad_about_A2 (everyNP (cn_as_VarCN (adjCN (comparAP young_A) zip_code_N)))))
("be mad-about- every younger zip-code"), there are 3301068879561 trees up to that size.
To take a slightly larger tree, consText (sText (s (if_thenS (falseS (thereNP (someCollNP girl_N))) (thereNP (someCollNP girl_N))))) (baseText (impVP (vVP accept_V)))
("if it is false that there are some girls then there are some girls . do accept !"), there are 106725330795495054412 possible trees up to that size. So gt
for the whole grammar is out of the question :grin:
But I'm surprised that gr
is also slow. I tested it myself, and sometimes it generates a sentence immediately, sometimes it takes a long time, but pressing Ctrl+C kills it immediately. I don't know why this happens.
(Offtopic: how did I get those big numbers? By running gftest with the --count-trees
flag—it's using this magical enumeration stuff so it can just count the trees without actually generating them.)
So is this the next issue you're facing? Would you like to have a systematic generation of ACE examples that is exhaustive by some other metric than "produce literally infinite amount of trees"? I may be able to help you with that, but we need to step out of the GF shell to do that.
In what I've done so far, Grammatical Framework doesn't seem to like the degree symbol included in mn_pl('°C', '°C') and mn_sg('°C', '°C'), so I'll have to make sure those lines are excluded when the script is run.
Yes, if you try to make a GF identifier that includes Unicode, you need to wrap the whole identifier name in ''
, like 'å_N' : N ;
.
So is this the next issue you're facing? Would you like to have a systematic generation of ACE examples that is exhaustive by some other metric than "produce literally infinite amount of trees"? I may be able to help you with that, but we need to step out of the GF shell to do that.
@inariksit Well, I certainly don't need an infinite number of sentences haha! I also don't really need the example sentences to be exhaustive in any way except for maybe each word in the lexicon making an appearance. My end goal with this is to use the sentences as data for a Neural Network, so about 1,000,000 sentences all with similar lengths to those in tests/ace/sentences.txt
would be ideal. I'm curious what tools are available outside the GF shell
Also, I noticed that many times, even with relatively low max depths, sentences just become jumbles of nothing that are upwards of 20 words long. Is there a way to limit the sentence length as well as the max depth?
I probably should have described my project in more detail earlier, but better late than never! I'm working on a system that automatically "translates" natural English into ACE similar to what the researchers in this paper: Rewriting simplified text into a controlled natural language. Whereas this team decided to use a rule-based system, I want to build a system that is trained like an autoencoder but where the output necessarily conforms to the grammar of ACE. I haven't committed to any system architecture yet, but I'll most likely use a seq2seq model with attention.
The example sentences I want to generate will be used as both the input and target output of this model during training. This is my workaround for the lack of a dataset of pairs of Natural English and ACE where the Natural English couldn't be directly parsed by APE. Since every ACE sentence is also a valid English sentence, this solution seems like the best option that is also feasible.
Lastly, I'm aware that the gr function in the GF shell allows for probability weights, do you think I would be able to leverage that in the decoder for my system?
I'm curious what tools are available outside the GF shell
Not a single, pre-existing tool that does what you want, I was just thinking of writing some custom code using the PGF library.
For instance, if you would be happy to take the 2700 sentences in sentences.txt as templates, and plug different words in them, so that in the end you use the whole lexicon, that would be quite feasible to do: I believe generateFrom takes an expression with metavariable (sort of a hole in the tree), and generates stuff in the hole. So if the hole is of a, say, noun, then it would put every noun in the hole.
Also, I noticed that many times, even with relatively low max depths, sentences just become jumbles of nothing that are upwards of 20 words long. Is there a way to limit the sentence length as well as the max depth?
There's no direct mechanism for that. The PGF library (as well as the GF shell) work on tree depth, and the gftest tool works on the number of constructors, but neither of them has any access to the number of words in the linearisation. Someone had a similar problem, and Aarne wrote a grammar that does exactly 3-word sentences here. Of course, the 3-word grammar itself is not usable to you, but the discussion can still be interesting to read, to understand better the technical limitations.
Now that I know your scope, I would actually suggest that you write about your project on the GF mailing list and ask for suggestions! I know that other people have used GF in data augmentation, so they would probably have some insights that I don't have.
@danshaub Are we btw done with the changes in the grammars? If so, I will clean up (e.g. remove the useless PGF from 80e55da and lexicon files). Just want to confirm with you so I won't mess up your stuff by force pushing.
@Kaljurand you said you want to exlude the generated lexicon files, that's fine–we'll just push the updated script(s)! How about the PGF file generated by make-pgf.bash
?
@inariksit I'd prefer to create a tag, make a corresponding release, and attach all larger generated files (e.g. PGFs, GF source for the lexica, etc.) to this release, but otherwise keep them out of the repository.
@inariksit We should be done changing the grammars now :) feel free to remove any useless PFG files, etc. The pull request I made should have everything, so make sure to merge that before deleting anything.
Not a single, pre-existing tool that does what you want, I was just thinking of writing some custom code using the PGF library.
For instance, if you would be happy to take the 2700 sentences in sentences.txt as templates, and plug different words in them, so that in the end you use the whole lexicon, that would be quite feasible to do: I believe generateFrom takes an expression with metavariable (sort of a hole in the tree), and generates stuff in the hole. So if the hole is of a, say, noun, then it would put every noun in the hole.
That sounds like a great idea! Would you be willing to meet live to discuss this? It's clear we're on different time zones, but I wouldn't mind meeting in the evening in my time zone. I am in Pacific Standard Time and would be fine to meet any time before about 11:00 PM in that time zone.
@Kaljurand That pull request has some scripts that directly download and build with the lexicons in the Attempto repositories. I also included a script that downloads a fork of the Clex repo that just has all the words with accents removed so they don't potentially mess with grammatical framework. I was able to reuse your prolog download code but I ran into a strange error in that any lexicon downloaded from within prolog would replace the degree symbol '°' with an unknown character placeholder so I included a curl command within the bash scripts. My most recent commit should have the clex
folder all cleaned up :)
@danshaub sure! I'll get back to you when I have a suggestion for a time.
I've cleaned up the commits now, I hope I didn't accidentally remove something crucial. @danshaub would you like to test that everything still works?
@Kaljurand I think someone else than me can do the release–I am really not sure what to include, I just like fixing GF issues. :-P Would @danshaub be interested in doing that?
Thanks @inariksit and @danshaub ! I've been busy with vacationing so far, but will start looking into this pull request in the coming days, and will try to merge it. Let me know if you still want to push something before the merge.
I think this pull request can be merged. It achieves its main goals:
Thanks a lot! :)
Just for the record, here is a list of things that I noticed during my testing, and that could be fixed in future checkins (e.g. by me):
The general task is to review how close is (the new) ACE-in-GF actually to ACE. It has never been 100% compatible, of course, that's why the various experiments with subsets like OWL and Codeco, but maybe the recent changes have introduced incompatibilities that could be easily avoided. E.g. "gr | l" generates:
recreate yourself , nothing Z !
while correct ACE would be:
something Z , recreate yourself !
This should also be kept in mind then randomly generating ACE sentences with the ACE-in-GF grammar, e.g. it would make sense to control the generation in various ways:
document how to get "ghc --make -o Parser Parser.hs" etc. to succeed (currently complains that: Could not find module ‘PGF’)
I would recommend using Stack: add a package.yaml and stack.yaml files to the repo, and link to https://docs.haskellstack.org/en/stable/install_and_upgrade/ or just paste the instructions on how to install Stack.
@Kaljurand I'd be happy to help out with any of those future updates now that I'm a bit more integrated into GF and Attempto. Just reach out at any time :)
Both of you, thank you so much for the help in getting this all working! I'll be sure to keep you updated on the progress of my project as it continues.
This compiles with the RGL commit 4f821ca621a418bba1a306b00063617307fba415 and the latest release of GF 3.11.