PSeitz / wana_kana_rust

Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
MIT License
70 stars 14 forks source link

Behave more logically consistent with empty inputs #6

Closed tmfink closed 1 year ago

tmfink commented 4 years ago

The mathematical convention is that for any element property, empty sets satisfy the property (because all elements satisfy the property). This is called "vacuous truth".

This is consistent with the behavior of Iterator::all() and Iterator::any().

For example, we will treat:

is_hiragana("") -> true
is_japanese("") -> true

Without this change, we would have the behavior that adding characters to a string could unintuitively change a property from false -> true -> false:

is_japanese("") -> false
is_japanese("あ") -> true  // Add あ
is_japanese("あA") -> false  // Add A

Links:

tmfink commented 4 years ago

Also, this is a breaking change that will require a minor version bump

PSeitz commented 4 years ago

Yes, I also thought about that when porting that library, but the original implementation behaves like this: https://github.com/WaniKani/WanaKana/blob/master/src/isKanji.js#L21

I think the reasoning behind this is, that you want to use them to know if you should start some conversion, which is not the case for empty strings, or maybe you have a list of strings, where you would want to know if they contain Japanese, e.g.

let words = ["hello", "", "some", "text"];
let japanese_word_detected = words.iter().some(|word| is_japanese(word));
tmfink commented 4 years ago

I recommend we break with the original API. At least in the isKanji() example you linked to, there is no documentation about the behavior for empty input. Users would need to read the code know the expected behavior.

We can bring this issue up with the original implementation as well. The behavior is opposite of what is expected.

In either case, we should clearly document the behavior/convention used in the crate.

PSeitz commented 4 years ago

I think it's often the case where you want to check to the existence of japanese/kanji characters to switch to a special behaviour, like in the example I provided. Do you have a counter-example, where you would want the empty=japanese behaviour?

You set analogy makes sense, but as an counterexample, since this is not stricly a set, but also a natural language field: If someone doesn't speak, does he speak japanese? ;)

I agree the behaviour should be well documented.

tmfink commented 4 years ago

I think it's often the case where you want to check to the existence of japanese/kanji characters to switch to a special behaviour, like in the example I provided. Do you have a counter-example, where you would want the empty=japanese behaviour?

Sorry for the late reply. Here are some examples:

1) Given a Japanese string, should all substrings be considered Japanese (including ""). The expected behavior is yes. 2) If you append the empty string to a Japanese string, the result will be a Japanese string. It would be strange if empty string was not considered Japanese, since that would mean appending a non-Japanese would preserve the Japanese property.

PSeitz commented 3 years ago

Sorry for the late reply. Here are some examples:

1. Given a Japanese string, should all substrings be considered Japanese (including `""`). The expected behavior is yes.

2. If you append the empty string to a Japanese string, the result will be a Japanese string. It would be strange if empty string was not considered Japanese, since that would mean appending a non-Japanese would preserve the Japanese property.

Sorry for the late reply. The examples assume that you want strict set behaviour. I meant a use case where you want strict set behaviour.

tmfink commented 3 years ago

The examples assume that you want strict set behaviour. I meant a use case where you want strict set behaviour.

I would expect strict set behavior for any function that sounds like a character property applied to any/all characters of a string:

For example, is_hiragana() and is_hiragana() applied to collections of characters (&str in this case). I would expect implementation to be something like:

fn is_hiragana_str(s: &str) -> bool {
    s.chars().all(is_hiragana_char)  // Iterator::all() follows set semantics
}

In contrast, a function like is_mixed() does not fit into this case. There is no "mixed" property that applies to an individual character. Instead, it depends on several character properties and whether more than one "class" of characters is represented.

PSeitz commented 3 years ago

I would expect strict set behavior for any function that sounds like a character property applied to any/all characters of a string:

* "this property holds for ALL characters"

* "this property holds for ANY character"

But this is the case, the issue is only the empty input, which is not a character.