Maihiki cross-file matching

nataliacp commented 8 years ago

Our latest attempt to match the full and the partial list for Maihiki failed to locate 220 items from the 740 list in the full source. You can find those 220 items in the Maihiki tab of the latest excel spreadsheet I sent out. As a reminder the matching is based on the FUN field in the full source file and the $$ field in the wordlisth which are expected to be identical.

levmichael commented 8 years ago

Could you create a folder on this repository and depost the Excel file in question there? (Unless there is a good reason for not doing so.) I'd like to have a look at it as well.

nataliacp commented 8 years ago

ahem, I don't know how to do this. I will send it to you by email and I will try to figure this out later tonight.

levmichael commented 8 years ago

ah, no problem -- yeah, I spent a little time yesterday learning to do things like this; there are a number of different ways to do this, and I chose the GitHub Desktop + cloned repository approach. Since you sent me the file, I could put it up on GitHub myself...

LinguList commented 8 years ago

The good thing about github is that you can actually upload tsv-files, that is, files in only one excel sheet which are in pure text and separated by tabstop. You can export them from drive. So if you do so and upload them to github, you can even nicely vie and search them. I recommend to do so, since if I have files for all separate languages, I can create much information with help of my tools, regarding frequency of stuff, etc. What is also good is that all changes can be tracked, and we know who changed what...

LinguList commented 8 years ago

Just saw too late your answer, Natalia, so here's what I recommend: use the GUI interface for mac or windows, which works farely well, and just look at some documentation how to "commit" and "push". Ideally, use the flat tab-separated format, but I can also adapt it and then show you how I imagine a github-based workflow, and how one can nicely link files in the issues.

nataliacp commented 8 years ago

Another issue with the Maihiki matching procedure is that there were still delimiter errors. there is a list of those in the same email I resent to Lev. Just make sure that these are fixed before we generate the spreadsheet for Maihiki.

levmichael commented 8 years ago

OK, great, we'll deal with this.

nataliacp commented 8 years ago

Just to clarify, the delimiter errors should be fixed in the 740 file.

amaliaskilton commented 8 years ago

@nataliacp: I checked the Mai data source file which I most recently sent you, and I'm not sure what's going on with the 220 missing items. These are on the complete lexicon list (or at least the first 25 were when I searched for them individually). however, when I copied and pasted the quasi-phonemic forms out of the 740-item list to search for them in my lexicon, often the search did not work - presumably because of the same kind of encoding issues we had with the Kubeo data. I don't know why that would have happened, because I entered all of the data using literally the same IPA keyboard and the same machine. Maybe it's a problem with google docs.

Delimiters: I am fixing them now in the 740-item sheet.

On Mon, Feb 15, 2016 at 11:03 AM, Natalia Chousou-Polydouri < notifications@github.com> wrote:

Just to clarify, the delimiter errors should be fixed in the 740 file.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/7#issuecomment-184348891 .

amaliaskilton commented 8 years ago

@nataliacp: Delimiters now done in 740.

On Tue, Feb 16, 2016 at 8:43 AM, Amalia Horan Skilton < amalia.skilton@berkeley.edu> wrote:

@nataliacp: I checked the Mai data source file which I most recently sent you, and I'm not sure what's going on with the 220 missing items. These are on the complete lexicon list (or at least the first 25 were when I searched for them individually). however, when I copied and pasted the quasi-phonemic forms out of the 740-item list to search for them in my lexicon, often the search did not work - presumably because of the same kind of encoding issues we had with the Kubeo data. I don't know why that would have happened, because I entered all of the data using literally the same IPA keyboard and the same machine. Maybe it's a problem with google docs.

Delimiters: I am fixing them now in the 740-item sheet.

On Mon, Feb 15, 2016 at 11:03 AM, Natalia Chousou-Polydouri < notifications@github.com> wrote:

Just to clarify, the delimiter errors should be fixed in the 740 file.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/7#issuecomment-184348891 .

nataliacp commented 8 years ago

thanks Amalia! We are going to try again after we regularize both files for the characters and see what happens.

nataliacp commented 8 years ago

unfortunately, the news are not that great for Maihiki. With the regularization across both files to decompose all segments and diacritics, the number of not found items fell by 42. So, encoding was partially the problem, but not the only one. Now there are still 177 items in the 740 list that cannot be found in the full source. We looked up manually some of them and here is a summary of the errors we found:

attached morphology (?) e.g. arrow in 740, dart and fishing spear in the full source
inconsistent segments e.g. arrive where h alternated with d in the two files
tone inconsistencies e.g. arm or forearm this is by far the most pervasive problem. there are tones missing, tones doubled on the same segment, tones different.

in order to visualize clearly the doubled tones, use Charis SIL or Doulos SIL fonts, they make them appear one over the other so it is easier to spot them.

I am sending you by email a new Maihiki file, which is essentially the full source as you sent it to us plus the items we cannot identify highlighted in blue. the file is sorted alphabetically so it is easy to spot the entries that should match. Please make the appropriate modifications in the 740 list and the full template, not on the file I am sending you.

We also stumbled upon another problem, that needs to be dealt with. There are homophonous words with completely different translations in teh full source, and one of them only should be matched with the 740 list. Today we found mósá (with three matches). Seb is working on making a list of those, but they are not included in the file I am sending you.

finally a question: in the full source file you have filled in an ORT field which looks identical (is it?) with the // field in the 740 file. Where should this information go in the end? If the source representation is phonemic, i see no reason why not putting it in the PHM field, instead of the ORT field (which is for orthographic representation)

levmichael commented 8 years ago

Thank very much for this. Sorry that the Mai data is being difficult. We will try to deal with this promptly.

nataliacp commented 8 years ago

I have just sent by email to Lev and Amalia the newest Maihiki file that Seb made which includes all the multiple matches in yellow as well as the non-matches in blue. Here is what you need to do with this. (and please ignore my previous instructions, Seb got better ideas about how to deal with this!)

blue lines: the data are taken from the 740 file and they cannot be matched to the whole source. Correct either the 740 file or the attached file (depending on which representation needs fixing) and erase the blue line.

yellow lines: again here the data are taken from the 740 file and the matching is ambiguous for either of two reasons (or a combination of both...)

there are two or more entries in the 740 file with identical FUNs (this is more often because the word is polysemous and maybe because of homophony) but only one entry in the original source.
there are two or more entries in the full source with identical FUNs (this normally is because of homophony, since we haven't split polysemous words yet in the full source) and only one entry in the 740 file.

In case 1, you need to deal as you would normally for polysemous words, i.e. split the original data in one row per meaning etc. Note that you need to make these adjustments in the white rows of the spreadsheet taking the corresponding identifiers from the yellow rows.

In case 2 you need to choose which entry of the full source corresponds to the entry from he 740 file. Then you need to copy the TUE of the yellow row into the TUE of the matching white row.

if both cases above are combined (i.e. you have multiple entries with identical FUNs in both the full source and the 740 file) you need to sort out the situation using both techniques as needed.

In the end, you can erase the yellow lines.

After all this is done, you would have a new file corresponding to the full source of Maihiki, which you can send to us to regenerate the importation template with all the matching (hopefully) done.

Let me know if you have any questions

amaliaskilton commented 8 years ago

Hi Natalia,

I looked at this file and made the corrections you requested to the first 1050 rows (roughly 1/3 of the file). However, I'm not totally sure that I understood what you wanted done with the polysemous rows, so I'm going to send you my partially-edited file and a text file that explains what I did and where, in a separate message.

On Wed, Feb 17, 2016 at 10:56 AM, Natalia Chousou-Polydouri < notifications@github.com> wrote:

I have just sent by email to Lev and Amalia the newest Maihiki file that Seb made which includes all the multiple matches in yellow as well as the non-matches in blue. Here is what you need to do with this. (and please ignore my previous instructions, Seb got better ideas about how to deal with this!)

blue lines: the data are taken from the 740 file and they cannot be matched to the whole source. Correct either the 740 file or the attached file (depending on which representation needs fixing) and erase the blue line.

yellow lines: again here the data are taken from the 740 file and the matching is ambiguous for either of two reasons (or a combination of both...)

there are two or more entries in the 740 file with identical FUNs (this is more often because the word is polysemous and maybe because of homophony) but only one entry in the original source.

there are two or more entries in the full source with identical FUNs (this normally is because of homophony, since we haven't split polysemous words yet in the full source) and only one entry in the 740 file.

In case 1, you need to deal as you would normally for polysemous words, i.e. split the original data in one row per meaning etc. Note that you need to make these adjustments in the white rows of the spreadsheet taking the corresponding identifiers from the yellow rows.

In case 2 you need to choose which entry of the full source corresponds to the entry from he 740 file. Then you need to copy the TUE of the yellow row into the TUE of the matching white row.

if both cases above are combined (i.e. you have multiple entries with identical FUNs in both the full source and the 740 file) you need to sort out the situation using both techniques as needed.

In the end, you can erase the yellow lines.

After all this is done, you would have a new file corresponding to the full source of Maihiki, which you can send to us to regenerate the importation template with all the matching (hopefully) done.

Let me know if you have any questions

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/7#issuecomment-185347833 .

nataliacp commented 8 years ago

Hello, from a quick look, things are not totally ok. No words should have identical id numbers. I think it is way easier to talk about it rather than write another long email with instructions. I am available right now and for a couple of hours at least if you are, or then on Monday. let me know when it is good for you.

amaliaskilton commented 8 years ago

Can we talk before the group Skype on Monday?

On Fri, Feb 19, 2016 at 11:09 AM, Natalia Chousou-Polydouri < notifications@github.com> wrote:

Hello, from a quick look, things are not totally ok. No words should have identical id numbers. I think it is way easier to talk about it rather than write another long email with instructions. I am available right now and for a couple of hours at least if you are, or then on Monday. let me know when it is good for you.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/7#issuecomment-186364331 .

levmichael commented 8 years ago

@nataliacp and @amaliaskilton, if its practical, I'd like to participate in the meeting, if it happens this weekend. I'm free any time except after 12:30pm on Sunday.

nataliacp commented 8 years ago

I noticed a weird thing in the Maihiki data. There is one item dɨ́à translated as chupo and boil. I am not sure what is going on with this. In earlier versions it seems that this item was morphologically complex?

amaliaskilton commented 8 years ago

It's the noun 'boil,' like 'blister.' The earlier versions of the sheet had a morphologically complex item for this concept, but I removed the morphology because it is not constant across the paradigm (i.e. not same in sg and pl).

On Thu, Feb 25, 2016 at 1:26 PM, Natalia Chousou-Polydouri < notifications@github.com> wrote:

I noticed a weird thing in the Maihiki data. There is one item dɨ́à translated as chupo and boil. I am not sure what is going on with this. In earlier versions it seems that this item was morphologically complex?

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/7#issuecomment-188995458 .

nataliacp commented 8 years ago

wow, I had no idea that this word even existed! I just looked it up in the dictionary, you mean this, right? noun, Pathology. 1. a painful, circumscribed inflammation of the skin or a hair follicle, having a dead, suppurating inner core: usually caused by a staphylococcal infection.

It's not really my business, but with an eye to the future, I would add something more explanatory in cases like this for future users of the database. I think that coming to that in a dataset I would interprete this as a mistake. also, fyi, you could add both singular and plural forms if they are irregular either in the same entry or in the comments for such things.

LinguList commented 8 years ago

"boil" occurs in this form actually in the concepticon:

https://github.com/clld/concepticon-data/blob/master/concepticondata/concepticon.tsv#L794

digling / tukano-project

Maihiki cross-file matching #7