bjascob / pyInflect

A python module for word inflections designed for use with spaCy.
MIT License
91 stars 5 forks source link

Some suggestions #1

Closed chrisjbryant closed 5 years ago

chrisjbryant commented 5 years ago

Heya,

Like the new version! :)

Some quick suggestions:

  1. To make it easier to use, I'd list all the valid POS tags and their meanings in the readme. Some people might not be familiar with PTB tags or what the package can generate.

  2. If you have time, I also think it'd be nice if getAllInflections returned a dictionary rather than a list: E.g. {"VB": ("eat"), "VBZ": ("eats"), ...}. That way, we could also see all the forms that are available. It's true you can generally infer this from the position of the tuple in the list, but I think it'd be good to make it explicit.

With this in mind, I've attached the code I wrote previously to convert the infl.txt into JSON format with POS tags: new_agid.zip I think working from this file should also bypsas most of the other issues you have with overrides.

  1. I also thought you were going to make the pos_type argument optional in getAllInflections. So you can specify V, A or N, but also leave it blank if you want to see all options; E.g.
    > getAllInflections("watch")
    {"NN": ("watch"), "VB": ("watch"), ...}
bjascob commented 5 years ago

It looks like you've used the key to your dictionary as the lemma. I've been using the key as a tuple (lemma, pos_type). In the raw AGID there are a lot of lemmas with multiple entries for different pos_types (ie.. N, V, A) and those entries have very different inflections. Looks like you're only keeping the first one you find. I may be wrong but I think we need to always specify which pos_type of the word we're looking for to get the correct inflections. This is why getAllInflections requires an N, V or A. It does not require the Penn Tree tag.

There is another function getInflection that takes a Penn Tree tag. I infer the N,V or A from the tree tag to get the correct dictionary entry. These two functions are intended to provide the two options, 1 to get all inflections for a word and, 2 to get only the inflection for a specific tag.

I can certainly put some info on tags in the README. Changing the return value from getAllInflections to be a dictionary that keys off of the treebank tag is a little bit more complicated. There's the "be" corner case and the fact that multiple tags often refer to the same inflection, however it's probably a reasonable change. I'll look into it.

chrisjbryant commented 5 years ago

On the contrary, I take the fact that a lemma can have multiple POS into account. E.g. {"watch": {"NN": ["watch"], "NNS": ["watches"], "VB": ["watch"], "VBD": ["watched"], "VBG": ["watching"], "VBP": ["watch"], "VBZ": ["watches"], "VBN": ["watched"]}} The form dict contains both the noun and verb forms. If the lemma is the same for multiple POS, it just adds them to an existing dictionary.

getAllInflections("watch") should return that whole dict. getAllInflections("watch", "N") should return {"NN": ["watch"], "NNS": ["watches"]} getAllInflections("watch", "V") should return {"VB": ["watch"], "VBD": ["watched"], "VBG": ["watching"], "VBP": ["watch"], "VBZ": ["watches"], "VBN": ["watched"]} getAllInflections("watch", "A") should return nothing.

My JSON also takes the fact that some POS map to the same form into account; e.g. VB and VBP go to "watch". I designed the format to have all the options available so think it should be easier to work with.

bjascob commented 5 years ago

OK. I see how you've organized this in your code. Probably not too difficult for me to make the pos_type option and just concatenate any of the 3 pos_types that exist.

bjascob commented 5 years ago

I updated the code to be more dictionary based. Data from getInflections will now return a dict as you've specified above. There is only one standalone method now. You can specify the pos_type or tag. Both will limit the returned inflections.

I wasn't in favor of returning dictionaries originally because for some inflections you will be returning invalid treebank tag forms. For example, you see this for pos_type='A' where we'll now be returning inflections tagged with RBR and JJR when often it is one or the other, not both. This may cause a little confusion but after considering, I don't think it's a big deal.

Just FYI... I also choose to use a csv file for data rather than json because csv is more compact and faster to parse. I've found json a little slow for large files. Doing a quick test shows the csv is about 1/3 the size and 1/2 the time to load (250mS vs 500ms). None of these are big numbers so it probably doesn't make much difference. Load time differences on an old mechanical HDD might be noticeable but probably not a big concern.

chrisjbryant commented 5 years ago

I updated the code to be more dictionary based. Data from getInflections will now return a dict as you've specified above. There is only one standalone method now. You can specify the pos_type or tag. Both will limit the returned inflections.

Cool. I preferred the 2 method option with getInflection and getAllInflections though. It's kind of redundant for the current getInflections("watch", "VBD") to return a dict where the only key is identical to the input; i.e. {"VBD": ("watched",)}. Instead, I think the following makes more sense: Function 1: getInflection requires a PTB tag and returns a tuple:

> getInflection("watch", "VBD")
("watched",)

Function 2: getAllInflections takes an optional pos_type (A,N,V) and returns a dict:

> getAllInflections("watch")
{'NN': ('watch',), 'NNS': ('watches',), 'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}
> getAllInflections("watch", "N")
{'NN': ('watch',), 'NNS': ('watches',)}

In this latter case, you can just return all the entries that start with N for Noun, all the entries that start with V for Verbs (and M for Modals?), otherwise everything else for Adj/Adv (JJ/RB).

I wasn't in favor of returning dictionaries originally because for some inflections you will be returning invalid treebank tag forms. For example, you see this for pos_type='A' where we'll now be returning inflections tagged with RBR and JJR when often it is one or the other, not both. This may cause a little confusion but after considering, I don't think it's a big deal.

I don't think we can avoid this anyway since the AGID simply doesn't make the distinction. I agree this could cause a bit of confusion, but also ultimately don't think it's a big deal.

Just FYI... I also choose to use a csv file for data rather than json because csv is more compact and faster to parse. I've found json a little slow for large files. Doing a quick test shows the csv is about 1/3 the size and 1/2 the time to load (250mS vs 500ms). None of these are big numbers so it probably doesn't make much difference. Load time differences on an old mechanical HDD might be noticeable but probably not a big concern.

Yep, no worries!

On a different topic, something is not working properly with the current version.

>>> from pyinflect import getInflections
>>> getInflections("watch", "V")
{'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}
>>> getInflections("watch", "N")
{}
>>> getInflections("watch", "V")
{}
>>> getInflections("watch")
{}

It returns it correctly the first time, but then stops working if you call it again. Not sure what's causing this.

Another tiny thing: I get an error when I try to use a version of spacy older than 2.0 (I often use 1.9 for compatibility reasons). In particular, older versions didn't have the ability to have custom Token attributes. One solution would be to check whether hasattr(Token, "set_extension"), or else just make sure spacy version >=2;

>>> import spacy
>>> spacy.__version__
'1.9.0'
bjascob commented 5 years ago

I think the errors you're seeing is because I'm getting a reference to the lemma in the main dictionary and then removing keys that aren't needed for the pos_type. I need to change this to do a copy so I'm not messing up the source data.

I'll change back to using two 'get' functions. I agree it's probably better to have one return a dict and the other a tuple.

Having Spacy installed isn't a requirement. I've got the import in a try/catch block. I'll add a version check in there and put a note in the Readme about needing version 2.0 or later.

bjascob commented 5 years ago

Version 0.4.0 is posted to PyPi now and should resolve these issues.

chrisjbryant commented 5 years ago

Perfect!

A little bit off-topic, but what is the KnownIssues.txt file exactly? E.g. beat/VBD -> {'beaten', 'beat'} Does this mean you found both "beat" and "beaten" as a "VBD" in the test corpus?

If so, that's not an issue with this module, but rather that POS taggers make mistakes (which is expected). Since the main goal of the module is just to return a given inflected form however, I wouldn't try to fix POS tagger errors since they're going to be unpredictable and depend on which POS tagger you use.

bjascob commented 5 years ago

The list comes from running the tagger/pyinflect on the corpus and looking for words that inflect inconsistently. Sometimes this will be because of the tagger giving the wrong tag. Other times it may be because there are multiple spellings for an inflected form or there's simply a misspelling (ie. pay/VBN -> {'paid', 'payed'}). Another possibility is when the lemma/tag is ambiguous to the form (ie.. people/NNS -> {'people', 'peoples'})

I wanted to create a list of words/tags that were potentially problematic so if anyone posted issues related to incorrect inflections I could see if multiple forms appear in general text, meaning it may (but not necessarily) require extra info to disambiguate them (ie.. there may not be a simple fix).