digling / burmish

LingPy plugin for handling a specific dataset
GNU General Public License v2.0
1 stars 1 forks source link

semi colons in underlying data #16

Closed nh36 closed 9 years ago

nh36 commented 9 years ago

xʐua⁵⁵; xa⁵⁵ mʐua³¹  | 474.41

I take this to mean xʐua⁵⁵ and xa⁵⁵ mʐua³¹  are possible forms of the word, but the system makes xʐua⁵⁵xa⁵⁵ mʐua³¹  into one word. This is 2269 Achang Longchuan.

LinguList commented 9 years ago

Yeah, realized this when manually editing the 0-cases of the "check_segments". There are actually more problems: I also found instances of " ~ " being used as a separator between words.

The genereal problem is that it is close to impossible to find all characters that people use to separate two words in their entries. Sometimes it's "/", sometimes it's "~", sometimes it's just a space, sometime sthey put stuff in brackets, etc.

Actually, I don't know why I missed the splitting of segments containing a semi-colon in a first instance. This is rather long time ago when I made the first preparation of the data, even before the app was running. I just checked the entries: there are only 39 entries containing a semi-colon, which is probably the reason, why it was missed.

I just corrected all these entries manually, just choosing one of the two possible variant words. You can find the cases I edited by looking for "1 @ lingulist" in the "check_segments".

By doing this manual check, we may loose a few interesting word forms initially, but since it is only 39 cases, it should be no problem to manually re-insert them from the "original_entry" column in case it turns out to be necessary later.

Right now, it is probably more important to get the variation out of the data. We can re-introduce it, ones we can handle it.

Please re-open this issue if you don't agree with my decision.