blagae / whitakers_words

Other
23 stars 5 forks source link

Many basic words are not being parsed properly #3

Closed Drewlark closed 2 years ago

Drewlark commented 2 years ago

I happen to be a latinist, and was messing around with the python implementation here as I'm more familiar with python and have been building a few neat things with this, but noticing some pretty huge problems.

There are a lot of words that appear to not be getting parsed in the same way as the original application, worst of all being the pronouns and other incredibly common words like "deus", which gives no forms. ie. ipsum (as in dolorem ipsum), which should be the ACC of ipse (and some other forms), shows up as basically not existing. ea gives the forms NOM PL NEU, ACC PL NEU, but does not give ABL S F, NOM S F, etc. These are all words that have very high frequency in the latin language and as such, with them not being properly inflected, the library is pretty much unuseable.

If anybody has any ideas on what might be wrong and where, I'd be happy to try and understand the code and submit a PR if I can fix it, but would need a little direction. I found the part of the application where it loads INFLECTS (because this file has all the correct data), but its a bit arcane. May do some deep digging, but thought I would raise this.

Drewlark commented 2 years ago

Sum forte latinistus, etiam ludens circa pythonis implementationem hic sic familior est pytho mihi et construebam pauce res delicias cum hoc, sed invenio nonnullos magnos errores.

Haec multa verba quae aperiunt ut non dividere recte velut applicatio prima, pessima omnis sunt pronomina et alia verba vulgarissima ut "deus", quod dat nullas formas atque "ipsum" (sicut 'delorem ipsum' Ciceronis) quod esse accusativum verbi "ipse" oportet, et videtur ut quidem non existimatur. "ea" dat formas NOM PL NEU, ACC PL NEU, sed non dat ABL S F, NOM S F et cetera verba. Haec omnes sunt verba quae videntur frequentissime in Lingua Latina et quidem, cum his inflectibus errore, librarium non est utibile.

Si aliquod habet aliquas sententias de qua sit non recte et quo, laetissimus sim petere et cognoscere codicem et submittam PR si figere possim, sed necesse est mihi aliquot directionem. Inveni partem applicationis quo ea legit INFLECTS (quia hoc filum habet omnes recta data), sed pauce arcanum est. Fodiam inferno, sed cognitavi indicare hoc.

Gratias vobis. Vale!

blagae commented 2 years ago

Hi @Drewlark thanks for your feedback. I am aware that the library is not usable for many common words, especially pronouns. The library is a work in progress, though unfortunately progress has been halted for a while due to time constraints.

If you want to take a crack at it yourself, I suggest that you first write some tests and debug them in a code editor, to see why they fail. There is some documentation in the README.md file and in project_structure.md that (I hope) explain how to set up your environment, and also how the logic works. The main reason for your issue with the pronouns is probably that I'm not loading or using info from the INFLECTS file. For deus and ipsum, I truly can't say without debugging.

You can check the file whitakers_words/matcher.py for the implementations, which are relatively straightforward. You will see that _pronoun_checker is just a barebones implementation, because I haven't gotten round to supporting it fully. You can also see in tests/integration/Test_Pronouns.py which types of pronouns are already (somewhat) supported.

If you can't find a solution, then you are also welcome to make a pull request with any number of failing integration tests, because that makes it easier for me to debug. Once there is a working implementation, those tests will also prevent regressions ! I must warn you that I'm unsure when I'll be able to look into fixing them, but any contribution is very welcome.

Drewlark commented 2 years ago

@blagae Appreciate the swift response. Hopefully I didn't come off as too acerbic, fwiw this is the least broken python port of WW, and im finding it somewhat OK to debug.

I've actually already narrowed it down using a debugger -- looks like it finds the proper inflects for a lot of pronouns but the POS on the inflect is marked as "N" so it immediately gets tossed as _dummy_negative. Just for kicks, went in and added some additional logic to accept N inflects for PRON stems, which looks like a promising first step, and now am getting through _pronoun_checker itself, which also then fails both inflects that it finds for "ipse" (as an example). Both of which could be fine, but because whatever is in the "n" key doesn't match, it fails. I'm still trying to figure out what exactly the n key represents, but it looks like its trying to match where the inflect should be on the string? Will try to figure it out, but let me know if anything here pops out at you.

What I'm working on is kind of something which doesn't exist -- that is, a word-processor style algorithm for grammatical/spelling error in latin. The funny part is I have all of that stuff done, but have to obviously either fix this implementation of WW or somehow wrap the other one. Since my stuff works both ways, it can also tell someone the "most likely" way to interpret the words in a clause (will see if it can avoid hanging adjectives/pronouns if possible, match verb person w/ nom. number or multiple nom if possible.) But the main idea, the grammar/spell checker, is pretty much solid. Will let you know if I come up w anything.

blagae commented 2 years ago

I didn't take any offense with what you're saying, as I'm aware that there are still many shortcomings to my implementation.

I took a quick look at the pronoun stuff, so that you (or I, in the future) can understand it better. The entry for is, ea, id in DICTLINE.GEN, line 22555, is:

i e PRON 4 1 PERS X X X A X he/she/it/they (by GENDER/NUMBER); DEMONST: that, he/she/it, they/them;

And in INFLECTS.LAT starting line 2868 you can find:

-- Ex: is ea id => i e i PRON 4 0 NOM S F 2 1 a X A

So in the dictionary, PRON 4 1 is the part that should fully match with the inflections entry, with 0 in the inflections being a "wildcard". Then in the second pair of numbers in the inflections, "2 1", 2 means the second stem, so "e" in this case, and 1 is the number of letters of the inflection (which is irrelevant for my implementation). In "X A", X is for standard Classical Latin, and A is for frequency, for which I have some support.

When I run whitaker parse ea on the command line, it seems to only find options for an incorrect entry, namely the lexeme "idem"

i e PRON 4 2 DEMONS X X X A X (w/-dem ONLY, idem, eadem, idem) same, the same, the very same, also;

This makes me think that (1) I'm not checking for the wildcard option in pronouns, and (2) that I'm not enforcing the requirement of the suffix -dem. You (or future me) can probably find hints in the other checker functions for (1), and I'd have to check how to resolve (2).

Drewlark commented 2 years ago

Actually this is extremely helpful -- I should be able to implement the wildcard suffixes without too much trouble. Really appreciate this. I also noticed that it seemed to be looking at "idem" (which for reference is the pronoun is/id/ea + suffix "dem" "the same".. so its close, I guess)

I didn't see any code in the pronoun checker, or even generally in the flow of the Matcher in general (when parsing pronouns) that would account for wildcards. Would explain a lot. Quite frankly it could fix all the issues I explicated in my OP. This will probably become my pet project for the next few weeks at least, hopefully I can get this fixed without introducing "false" words into the system, and then I think I could make pretty quick work of some of the special functions in the original WW which this lacks (although I hardly think that is an indictment of this library, a lot of those addons are of fairly low value and are usually wrong or if they are right, require a relatively knowledgeable Latinist to discern anyway).

Will be working on this (or at least looking into it) tonight. FWIW will keep you updated or bug you on here if I have clarifying questions. The answer you just provided me saved me hours.

kandji-andrew commented 2 years ago

Okay so I got this mostly fixed(simply made _pronoun_checker as follows:

def _pronoun_checker(stem: Stem, infl: Inflect) -> bool:
    return _check_right_stem(stem, infl) and _basic_matcher(stem, infl)

-- however, im left with literally just one issue -- ths also fixed "ea", including getting the ABL S F. This fixed "ipse" "ipsum" ipsis" etc, but "ipsa" gives everything but the ABL S F (for ipsa) -- there are 3 "viable inflections" it comes up with that are ABL S F, but I cant find any logical way that they would fit together... here are the 3 viable inflects and the stem list:

stem list

stem_list = {list: 2} [{'orth': 'ips', 'pos': 'PRON', 'form': ['X'], 'n': [6, 2], 'wid': 24735, 'props': ['X', 'X', 'X', 'A', 'O'], 'stem_number': 0}, {'orth': 'ips', 'pos': 'PRON', 'form': ['X'], 'n': [6, 2], 'wid': 24735, 'props': ['X', 'X', 'X', 'A', 'O'], 'stem_number': 1}]

viable_inflections (selected for ABL S F and PRON)

 47: {'ending': 'a', 'n': [1, 3], 'note': '', 'pos': 'PRON', 'form': ['ABL', 'S', 'F'], 'props': ['X', 'A'], 'iid': 2796, 'stem': 0}
 48: {'ending': 'a', 'n': [1, 4], 'note': '', 'pos': 'PRON', 'form': ['ABL', 'S', 'F'], 'props': ['X', 'A'], 'iid': 2800, 'stem': 0}
 52: {'ending': 'a', 'n': [4, 0], 'note': '', 'pos': 'PRON', 'form': ['ABL', 'S', 'F'], 'props': ['X', 'A'], 'iid': 2887, 'stem': 1}

Will keep thinking about this -- brain is kinda dead from work and this, but let me know if you see something here.

If you'd like me to make a PR or something on the small pronoun fix, I can do so, but would be awesome to get this one last thing worked out first.

blagae commented 2 years ago

@kandji-andrew I've tried your suggestion and while you report promising results for ea, it fails for one of the existing integration tests:

    def test_quos(self):
        result = self.par.parse("quos")
        self.assertEqual(len(result.forms), 1)
>       self.assertEqual(len(result.forms[0].analyses), 5)
E       AssertionError: 31 != 5

As you may guess, 31 options for a relatively straightforward pronoun is not likely to be correct. The reason is that you get rid of the condition and infl["n"] == stem["n"].

The _basic_checker function will default to True if the stem object doesn't have an 'n' (= classifier) property; this default is necessary for indeclinable words such as interjections and conjunctions. If you just add your condition to the existing conditions, all existing tests succeed. I haven't checked whether or not it fixes the issue you're trying to solve. If I have some time tomorrow (and if you haven't beat me to it) I'll see if I can write some tests for words like ea, ipsum, etc, to see if they come out right.

@Drewlark as the other commenter posted, there is indeed code that catches the wildcard option :-)

In general, for anyone interacting with the code in any way, I recommend that you read/use/debug the integration tests. I believe that they're fairly readable (in contrast to the unit tests, which are very low-level and somewhat arcane), and they exist to document the progress that is being made and to prevent reoccurring bugs. If you can find the time to write tests, even if they fail with the current implementation, that's actually a great help.

blagae commented 2 years ago

@kandji-andrew I'm gonna have to eat my words a bit. You were right about the implementation you suggested, and it turns out that there are indeed 31 correct hits for quos in the dictionary, because it lists so many cases. I have written some test cases that confirm that you were correct, but if you can definitely submit a PR with the fix itself and any tests you might have.

The reason that you're not finding any hits for the ABL S F of ipsa is not due to any problem with your code or mine: in the INFLECTS.LAT file, the ending is commented (along with a few others).

PRON 6 0 NOM S F 1 1 a X A --PRON 6 0 GEN S F 1 3 ius X A --PRON 6 0 DAT S F 1 1 i X A PRON 6 0 ACC S F 1 2 am X A --PRON 6 0 ABL S F 1 1 a X A

I don't know how it happened (I guess it was just in the source file ?), but removing the -- and regenerating the dictionaries (=removing the entire generated folder and reinstalling the library) fixes the problem.

kandji-andrew commented 2 years ago

I will say that it is odd "quos" is returning so many results, because they are all the same case, gender, etc. We probably will want a filter for this, that can compress any answers that are exactly the same into a single answer -- they even have the same ID.

blagae commented 2 years ago

It does make sense in a way, if you look at the dictionary: William Whitaker tried to give separate entries for each type of pronoun (interrogative, relative, demonstrative, ...) and quis/qui can kind of be all of them. He assigned many declensions to the PRON 1 class, because there are a lot of ways the word can appear. I do agree with you that it can be improved a lot, but my first goal is to get on par with the original Whitaker's Words, so this is not my priority right now.

blagae commented 2 years ago

@Drewlark I have found what's wrong with deus, and it's quite complex: there is an entry in the UNIQUES.LAT file, for the Vocative, which isn't de or di (~ meus) but just regular deus.

deus
N 2 1 VOC S M p X E X C X god; God!: Oh God;

I was under the impression that words which have a unique form do not need to be analyzed further, but apparently I was wrong. The faulty code is in whitakers_words.parser.Word.analyse(). The fix isn't fully trivial because there's some cleanup going on after the analysis.

Another problem is that the regular word for deus is capitalized in the dictionary, and I'm not catching that either:

De De N 2 1 M N X E X A O God (Christian text); god; divine essence/being, supreme being; statue of god;

Curiously, dea is not capitalized ;-)

de de N 1 1 F P X E X C O goddess;

So you've actually reported multiple very relevant issues, I'll have to take a bit of a deep dive to fix them.

blagae commented 2 years ago

Deus (vocative and case sensitivity), ea, ipsum, and ipsa are now fixed and tested for. Thanks for your feedback and cooperation !