Closed nh36 closed 9 years ago
Yeah, realized this when manually editing the 0-cases of the "check_segments". There are actually more problems: I also found instances of " ~ " being used as a separator between words.
The genereal problem is that it is close to impossible to find all characters that people use to separate two words in their entries. Sometimes it's "/", sometimes it's "~", sometimes it's just a space, sometime sthey put stuff in brackets, etc.
Actually, I don't know why I missed the splitting of segments containing a semi-colon in a first instance. This is rather long time ago when I made the first preparation of the data, even before the app was running. I just checked the entries: there are only 39 entries containing a semi-colon, which is probably the reason, why it was missed.
I just corrected all these entries manually, just choosing one of the two possible variant words. You can find the cases I edited by looking for "1 @ lingulist" in the "check_segments".
By doing this manual check, we may loose a few interesting word forms initially, but since it is only 39 cases, it should be no problem to manually re-insert them from the "original_entry" column in case it turns out to be necessary later.
Right now, it is probably more important to get the variation out of the data. We can re-introduce it, ones we can handle it.
Please re-open this issue if you don't agree with my decision.
xʐua⁵⁵; xa⁵⁵ mʐua³¹ | 474.41
I take this to mean xʐua⁵⁵ and xa⁵⁵ mʐua³¹ are possible forms of the word, but the system makes xʐua⁵⁵xa⁵⁵ mʐua³¹ into one word. This is 2269 Achang Longchuan.