Manu bugs - Githubissues

MarkJeronimus commented 5 years ago

I'm currently going over and adapting the JMdict parser for my own purposes, and in doing so I keep finding bugs, many small ones but occasionally a bigger one. The overall code quality seems low (about how I used to program 5 years ago so don't feel too offended).

So now I found a bug where re_restr is interpreted wrong. The DTD states "In its absence, all readings apply to all kanji elements." but your code doesn't link any readings to kanji in its absence (e.g. entry 々). I also found places where you throw hard exceptions when just 'yet unencountered items' appear in the XML data, which is always a bad idea for production code (for example JMParser#readCDATA).

Do you want to keep updated about every little thing I find?

LaurensWeyn commented 5 years ago

Sure. The JMDict parser was mainly a quick job to get rid of the horrible mess that was the regex based edict2 parser. It was such a huge improvement I haven't really touched it since.

Feel free to make a pull request or sugggest how to fix the issues. Though note the current active branch, major-refactor, isn't really compatible with the main branch.

Personally haven't experienced any issues with it in my usage, which is odd.

MarkJeronimus commented 5 years ago

In general, it seems like the parser can be rewritten using XMLStreamReader using it's basic form (i.e. without filters). I don't have experience in it but with filters the code may become even smaller.

Don't use checkTags but make the checks permanent. Without them, weird things can happen. Don't use exceptions like I said before.

There's also a reckless exception in commonScoreOf() (what if new dictionaries get added to JMdict?)

A collection containing enums should be done with an EnumSet, not a HashSet. Store in a field of just type Set so when reading it out it doesn't care what's in it.

Prevent null-pointers for collections wherever possible, use empty collections instead (EnumSet.ofNone(), Collections.emptyList(), etc). Java handles constructing/GC-ing objects super efficiently since jre7 or so, and when I measured it there's was no noticeable penalty when reading the entire MJdict-en.

Data-classes like Spelling and Sense should be immutable and all data given in a constructor . Alternatively pass partial data in a n overloaded constructor and use methods that add data and return a new object. Yet alternatively, use the builder pattern (not recommended for objects as simple as these).

Don't use 'magic numbers' like the codes in DefTag. Use another enum instead. The list in the comment isn't even complete. And as a general note, don't rely on comments (non-javadoc) to explain something relevant to the API.

Never, ever, rely on Enum.name(), use your own field for the name. The way you replace '-' with '_' is fragile and just a perfect example of why not to use name(). Enum entries are static finals so they should be in upper case.

Not in the parser, but I noticed the way you implement Japanese.isKana(), Japanese.isKanji(), etc is fragile. Better use Character.UnicodeBlock.of() for this. Character also has simpler methods like isIdeographic(). The class Japanese is also named inaptly, should be more like JapaneseUtilities.

toFullwidth() and toHalfwidth() are just wrong. Only alphanumeric and space are processed, and to top it off, tab is replaced by space. Punctuation etc are not handled while they just as easily could. (just shift the entire printable ASCII region from \u0021 to \uFF01) and handle space separately.

MarkJeronimus commented 5 years ago

I'd love to help you out and improve it, but I currently don't have more time otherwise I would just start from scratch myself. It just took 1 day to adapt your version to my needs and it became quite incompatible real quick.

LaurensWeyn / Spark-Reader

Manu bugs #30