emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.
MIT License
22 stars 10 forks source link

Proposal for a simple webapp included in Pie #6

Closed PonteIneptique closed 5 years ago

PonteIneptique commented 5 years ago

The goal here is to be able to have, later, a distributed series of APIs that could be using different version of Pie and run through containers, unified by a public API such as https://github.com/hipster-philology/deucalion which would then talk to the different micro-services.

From app.py : This module can be run by install the app requirements (pip install -r requirements-app.txt)

How to run for development :

PIE_MODEL=/home/thibault/dev/pie/model-lemma-2018_10_23-14_05_19.tar FLASK_ENV=development flask run

where PIE_MODEL is the path to your model

How to run in production :

gunicorn --workers 2 app:app --env PIE_MODEL=/home/thibault/dev/pie/model-lemma-2018_10_23-14_05_19.tar

Probably add to this a --bind

Example URL:

http://localhost:5000/?data=Ci+gist+saint+Martins+el+sains+de+tours.%20Il%20fut%20bon%20rois.

Example curl :

curl --data "data=Ci gist saint Martins el sains de tours. Il fut bon rois." http://localhost:5000

Example output :

    token   lemma   morph   pos
    ci  ci  DEGRE=- ADVgen
    gist    jesir   MODE=ind|TEMPS=pst|PERS.=3|NOMB.=s  VERcjg
    saint   saint   NOMB.=s|GENRE=f|CAS=r   ADJqua
    martins martin  NOMB.=s|GENRE=m|CAS=r   NOMcom
    el  en1+le  NOMB.=s|GENRE=m|CAS=r   PRE.DETdef
    sains   sain    NOMB.=p|GENRE=m|CAS=r   NOMcom
    de  de  MORPH=empty PRE
    tours   tor2    NOMB.=p|GENRE=f|CAS=r   NOMcom
    .   .   _   PONfrt
    il  il  PERS.=3|NOMB.=s|GENRE=m|CAS=n   PROper
    fut estre1  MODE=ind|TEMPS=psp|PERS.=3|NOMB.=s  VERcjg
    bon bon NOMB.=s|GENRE=m|CAS=n|DEGRE=p   ADJqua
    rois    roi2    NOMB.=s|GENRE=m|CAS=n   NOMcom
    .   .   _   PONfrt
PonteIneptique commented 5 years ago

FYI, we are about to release a demo API based on this https://github.com/PonteIneptique/deucalion-model-af

emanjavacas commented 5 years ago

Hi Thibault,

thanks for the PR and sorry that it took so long to react. Up to now, the code has been changing quite regularly due to ongoing optimizations and experimentation but right now PIE is starting to reach some level of stability. In particular the last merge I did has changed quite a lot of the basic functionality and it might be good to check if your PR still works before merging. I've also moved model_spec into utils so that you can use it for the webapp. PIE now does multitask learning in a more common and better perfoming way. Instead of computing a loss for all tasks for each batch, now only a single task is considered sampling from the tasks uniformly (this could also be change to make the sampling more skew towards particular classes). Additionally, for multilayer sentence encoders (if you set num_layers greater than 1) each task can be predicted at a different layer. It has been shown in the literature that different tasks are better learned at different layers. Another addition is different formalisms for lemmatization. It can be done char-level, but you can also predict binary edit trees (token-level) and variations thereof just by changing the paremeters in the config file.

There is also a script to pretrain the encoder as a language model. As you see, there is plenty room for experimentation and perhaps we could consider some collaboration to systematically explore some of this possible configurations.

As a side note, I've seen you guys are collapsing all morphology tasks into a single one, which considerably increases the label space. I am pretty sure you would get better performance if you split them and consider them as different tasks like this:

token   lemma   degree  mode    nomb    pers    temps   pos
ci  ci  -   _   _   _   _   ADVgen
gist    jesir   _   ind pst 3   s   VERcjg
...
PonteIneptique commented 5 years ago

Hey @emanjavacas , thanks for the heads up!

Few different replies or questions:

  1. in the context of prelemmatizing (which is what this model is focused on atm), actually having valid full morphology is more interesting than having let's say 90% valid morphological features with potential impossibilities (like temps=s|case=nom or something like that). Although, for textual analysis, training on single task might be more interesting.
  2. for the webapp, I actually need to move from full reply to streaming, so it's good you gave me a chance not to merge the PR right away : it seems with texts a bit larger than my first examples, things crashes. So I'll stream the reply by using the batches replies you are probably yielding at some point.
  3. As for the changes, the example script actually misses an example for POS right now. Could you add something about it ?

Cheers !

emanjavacas commented 5 years ago

Hi!

  1. Not sure what you mean with prelemmatization, but I really cannot think of a situation where fusing all morphology tasks into one can help... I'd be interested to see how it actually performs. When you train that way you still should evaluate wrt single tasks, so the numbers your are getting wrt fused morphology aren't very meaningful...

  2. ye, now everything is pretty much a generator.

  3. I agree everything is pretty much underdocumented. It will take some time until I get around to improve the documentation. You can try to read through default_setttings.json for inspiration. But I don't you'd need more than just adding {"name": "pos"} (+ "target": true if you want to optimize for POS tagging) in order to get POS tagging running.

let me know if you encounter other issues

PonteIneptique commented 5 years ago

Here you go :) (I would definitely recommend squashing this, nobody cares about my mistakes :P )

PonteIneptique commented 5 years ago

Thanks a lot ;)