GrammaticalFramework / GF

Archive of monolithic GF repository until 2018-07-25
http://www.grammaticalframework.org/
Other
189 stars 51 forks source link

WIP: Portuguese resource grammar #28

Closed odanoburu closed 6 years ago

odanoburu commented 6 years ago

this is a work in progress, but already has some useful stuff.

the main repo I'm working on is this one, which has the project's actual history.

arademaker commented 6 years ago

@aarneranta could you help us here? We are trying hard to debug the code to understand what is blocking the compilation of the Portuguese grammar. Even with verbosity set to 3, we don't have any clue about the cause of the problem. Any tips?

$ gf -v=3 --batch portuguese/LangPor.gf
...
  generating PMCFG
+ AdvS 1 (1,1)
+ AdvSlash 8 (8,1)
+ EmbedQS 1 (1,1)
+ EmbedS 1 (1,1)
+ EmbedVP 16 (2,2)
+ ExtAdvS 1 (1,1)
+ ImpVP 16 (4,4)
+ PredSCVP 16 (8,8)
+ PredVP 1536 (6264,6264)
+ RelS 4 (1,1)
+ SSubjS 2 (2,2)
+ SlashPrep 8 (4,1)
+ SlashVP 12288
inariksit commented 6 years ago

@odanoburu @arademaker It just seems that SentenceRomance is slow, see all the comments here: https://github.com/GrammaticalFramework/GF/blob/master/lib/src/romance/SentenceRomance.gf#L18-L140

(The number SlashVP 12288 means that the single category SlashVP is expanded into 12288 concrete categories, one for each combination of parameters that it can possibly get: anything like the number, gender and definiteness of its arguments; is it a pronoun or not; object case, such stuff. Anything that is a parameter in some category, that is even remotely used by SlashVP.)

I suggest you just comment out SlashVP and SlashVS from SentenceRomance and compile your Portuguese grammar, to be able to continue the development. For instance, the grammar used here http://cloud.grammaticalframework.org/wc.html does not even use SlashVP and SlashVS: https://github.com/GrammaticalFramework/GF/blob/master/examples/app/App.gf#L4-L24 see here all functions that are excluded.

When I commented out SlashVP and SlashVS from SentenceRomance, I get this error message:

- parsing IdiomPor.gf
  renaming IdiomPor.gf:30-39:
  Happened in the renaming of ProgrVP
   constant not found: estar_2
   given P, ParamX, Prelude, BeschPor, ParadigmsPor, MorphoPor,
         CommonX, CatPor, IdiomPor
IdiomPor.gf:21-23:
  Happened in the renaming of ExistNP
   constant not found: hay_3
   given P, ParamX, Prelude, BeschPor, ParadigmsPor, MorphoPor,
         CommonX, CatPor, IdiomPor
IdiomPor.gf:24-28:
  Happened in the renaming of ExistIP
   constant not found: hay_3
   given P, ParamX, Prelude, BeschPor, ParadigmsPor, MorphoPor,
         CommonX, CatPor, IdiomPor

Thanks for the work so far, and sorry for reacting slowly to the pull requests!

inariksit commented 6 years ago

Two more things:

1) Compile without commenting out Slash* from SentenceRomance

I don't actually know what is the reason, but if I try to compile any of the RGL languages that uses the Romance functor straight from the lib/src/ directory, it doesn't complete. But when I do cabal install from the root directory, it compiles all the languages in this list and puts the .gfos into $GF_LIB_PATH. The resulting LangSpa/Fre/Ita has the Slash* functions and they work fine. I don't know what magic causes this.

Here's what I've done:

At this point, I'm getting a whole lot of errors from IrregPor, such as following:

lib/src/portuguese/IrregPor.gf:
   lib/src/portuguese/IrregPor.gf:63372-63442:
     Happened in linearization of sentar_V
      wrong number of values in table table
                                        VFB
                                        ["sentar"; "sentando"; "sentado"; "siento"; "sientas";
                                         "sienta"; "sentamos"; "sentáis"; "sientan"; "siente";
                                         "sientes"; "siente"; "sentemos"; "sentéis"; "sienten";
                                         "sentaba"; "sentabas"; "sentaba"; "sentábamos";
                                         "sentabais"; "sentaban"; "sentara"; "sentaras"; "sentara";
                                         "sentáramos"; "sentarais"; "sentaran"; "sentase";
                                         "sentases"; "sentase"; "sentásemos"; "sentaseis";
                                         "sentasen"; "senté"; "sentaste"; "sentó"; "sentamos";
                                         "sentasteis"; "sentaron"; "sentaré"; "sentarás"; "sentará";
                                         "sentaremos"; "sentaréis"; "sentarán"; "sentare";
                                         "sentares"; "sentare"; "sentáremos"; "sentareis";
                                         "sentaren"; "sentaría"; "sentarías"; "sentaría";
                                         "sentaríamos"; "sentaríais"; "sentarían"; variants {};
                                         "sienta"; "siente"; "sentemos"; "sentad"; "sienten";
                                         "sentado"; "sentada"; "sentados"; "sentadas"]

If I use the old files in portuguese, it compiles, but all sentences it generates seems to be just Spanish.

So if you add Portuguese to both lists (languages and incomplete languages) in Setup.hs, you should be able to compile it yourselves!

2) Files in wrong place

The files CombinatorsPor.gf, ConstructorsPor.gf, SymbolicPor.gf, SyntaxPor.gf and TryPor.gf should be in the directory api, not portuguese. If you put them in the right place, then you don't need to have Portuguese in the list of incomplete languages.

odanoburu commented 6 years ago

When I commented out SlashVP and SlashVS from SentenceRomance, I get this error message:

At this point, I'm getting a whole lot of errors from IrregPor, such as following:

these were corrected on my repo, I'm now including them on my fork of this repo!

2) Files in wrong place

I'm correcting this, thanks!

1) Compile without commenting out Slash* from SentenceRomance

it now works!! :smile: thank you very much @inariksit

I'll update the PR now.

odanoburu commented 6 years ago

hello @inariksit , are you able to import all Portuguese tenses? I can compile GF, and I can import and use the Portuguese present tense, but not all tenses...

inariksit commented 6 years ago

@odanoburu It's really slow to link the Portuguese grammar--I just stopped it after 5 minutes. I can try overnight or some other time I don't have to do something else. But here's another hack, if you only want to test the linearisations, not parsing.

1) Import the grammar with the flag -retain

> i -retain LangPor.gfo
157 msec

2) Test any tree you like with cc (compute_concrete). You can see all options for cc if you type help cc into the GF shell.

> cc -table -unqual PredVP (UsePron i_Pron) (ComplSlash (SlashV2a drink_V2) (MassNP (UseN beer_N)))
s . DDir => RPres => Simul => RPos => Indic => eu bebo cerveja
s . DDir => RPres => Simul => RPos => Conjunct => eu beba cerveja
s . DDir => RPres => Simul => RNeg False => Indic => eu no bebo cerveja
…
s . DDir => RPast => Simul => RNeg True => Indic => eu no bebia cerveja
s . DDir => RPast => Simul => RNeg True => Conjunct => eu no bebesse cerveja
s . DDir => RPast => Anter => RPos => Indic => eu havia bebido cerveja
s . DDir => RPast => Anter => RPos => Conjunct => eu houvesse bebido cerveja
…
s . DInv => RCond => Anter => RNeg True => Indic => no haveria bebido cerveja eu
s . DInv => RCond => Anter => RNeg True => Conjunct => no haveria bebido cerveja eu

I see all tenses are formed in the output.

Some of the parameters are redundant, like the Boolean in RNeg, but I see that comes from the Romance functor. I can import Spanish grammar in 2 minutes or so, you could have a look if they have excluded something from the Romance to make it faster. Or have you added any parameters to the Portuguese, that could explain why it's slower?

arademaker commented 6 years ago

Hi @inariksit , I left it running overnight. It consumed 48G of RAM and it didn’t finished. Something is wrong ...

inariksit commented 6 years ago

@odanoburu Yeah same for me, it didn't finish overnight. I'm not all that surprised that someone has managed to write a GF grammar that doesn't finish compiling or linking (or whatever it is that makes it parse in addition to linearisation) on my computer :-P but the fact that Spanish does finish, makes it strange.

Does it work when you comment out SlashVP and SlashVS in the Romance functor?

odanoburu commented 6 years ago

@inariksit as commented by IRC, commenting SlashVP and SlashVS creates several errors that would require more commenting...

inariksit commented 6 years ago

@odanoburu I commented out Slash* and everything else that needed commenting out, here are the PGFs: http://old-darcs.grammaticalframework.org/~inari/portuguese/ Both of them are the same grammar, I just compiled the second with the flag --optimize-pgf; I'm including the first one just for curiosity. (The grammar testing tool runs the smaller one much faster too!)

odanoburu commented 6 years ago

hey @inariksit , thanks!! do these work for all tenses then??

I had no idea the optimized PGF could be this smaller, I guess I must read the PGF paper..

can you push the commented romance to a branch on your fork, please?

inariksit commented 6 years ago

@odanoburu Sorry about the late answer (I've turned off notifications on pretty much everything; if you ever want a quick answer from me, come to IRC! :-D) The commented out grammar works for all tenses. But even better, we've got a proper solution to your problem now! \o/ Aarne and I had a look at the grammar, and turns out it was only about the variants in BeschPor.gf. That's a good cautionary tale to not use variants in the resource grammar :-P (And also a reminder that we should really do something about the handling of the variants, so it doesn't blow up.)

Aarne shared another hack how to get the same behaviour as in variants: using the pre construction with an empty string, and a wildcard otherwise branch, it always creates the first one, but parses also the second one. It's all in this commit, which I pushed to the master repo. If you want to change the order of the variants, I suggest just flip x and y in the function vars.

odanoburu commented 6 years ago

@inariksit oh, I'm so glad you've found a solution! I didn't know that the variants were not well supported... but they do work, just not on big grammars...

thank you very much!

inariksit commented 6 years ago

@odanoburu Yeah, variants work, but they just cause an explosion of possibilities in the tables, which in this case leads into total freeze. The hack with pre ensures that the variants stay inside the tables.

Just as a curiosity, this is how pre works even for English: it parses even the wrong forms (an car, a animal), but only linearises the correct forms.

Languages: LangEng
Lang> p "an car" | l
a car

Lang> p "a animal" | l
an animal
odanoburu commented 6 years ago

@inariksit I see! are there are other restrictions on other constructions like variants? (nonExist, for instance?)

it parses even the wrong forms (an car, a animal), but only linearises the correct forms.

that's nice for this use-case, but in the case of BeschPor we'd like to parse Brazilian Portuguese and European Portuguese forms, and also linearize them (although there wouldn't be a clear way of selecting them, so I guess it's not that big of a loss!)

inariksit commented 6 years ago

@odanoburu In that case, it definitely makes sense to have two different functions for each form, or two different files. You could have two folders, brazilian and european in the GF/lib/src/portuguese folder, and in each of them put a BeschPor.gf which are otherwise identical but for the verb forms. Then in your LangPor.gf, put either brazilian or european in the path, e.g. --# -path=.:../romance:../abstract:../common:../api:brazilian, and change of brazilian to european changes the standard.

odanoburu commented 6 years ago

@inariksit that's a nice idea for the verbs and the lexicon! (which is what I intended to implemented anyway, because I don't know well enough the other differences)