Restructuring tables a bit

MiinaNo commented 3 years ago

@JakeJing , @xrotwang There are actually two issues but I thought I could write in the same place as they are about restructuring some things in the tables:

Sources - When going through the sources, I started thinking that would it be better to have all the sources used for answering the questions in one and the same cell? At the moment they are scattered between the columns titled Sources, Examples, and Comments. Maybe I would still keep the source after the example (?) but I could move the sources used in the comments cell to the source cell in the csv-files? Would this be reasonable?
Hidden examples in comments (I copied Robert's e-mail below) I will go through the examples and either move them to the example row or delete them. I need to do the deleting in cases the comment is about an answer with a value 0. I think it would only confuse people to have am example if the value is 0. This is why we used the comments section but I think it was a bad idea.

I think all this is doable as there are 30 tables (not 3000) but I thought I will ask for a second opinion before I start :)

Roberts e-mail from 20 October: I just realized that quite a few examples seem to be hidden in comments, too. E.g. https://uralic.clld.org/languages/25

I think it would be useful to extract these into proper examples,

to unclutter the comment
to make the example available in a structured format

xrotwang commented 3 years ago

ad 1) I think consolidating the source info in the Sources column would be really good. If a source relates specifically to the example given, this could be indicated as source1;source2[example].

ad 2) I wouldn't remove the examples for the 0 (i.e. negative) answers. If anything, we could hide them - or clearly label them - in the web app. But they sure provide additional context for answers.

MiinaNo commented 3 years ago

1) Great, this is very good advice. 2) Actually you are right, I started looking at the examples, and it feels a bit sad just to remove them.

Maybe I could take one table, make the changes and show the outcome to you before continuing with the next table.

xrotwang commented 3 years ago

Yes, that sounds like a good plan.

MiinaNo commented 3 years ago

@xrotwang

I have been now working with three tables to get a better picture (Lule Saami, Komi Zyrian and Pite Saami). I am writing my questions/comments here. Should I maybe do a pull request for Lule Saami table so that you could compare what I have done? I also copied relevant examples below.

Sources. I did it now as asked. I also added p.c. after the name as done in GB tables. In the case there were two examples from different sources, I used source[example1];source[example2]. See e.g. Kazym Khanty table.
Hidden examples. I somehow feel that in some cases it is not possible to separate the example from the comment, see e.g. UT140 in Lule Saami table. "There is a phonetic contrast between [æ] and [e], but there is no phonological opposition as they are in complementary distribution, e.g. [e] in germaj snake.NOM.SG vs. [æ] in gärmmaha snake.GEN.SG can be analysed as allophones of the same phoneme." I have to say that the hidden examples are the hardest case... I have no good solution.
I have been trying to remove all the ";"-s that were in any other meaning than line break. But was it fine to use "|" to show that another example is to follow? There are two ";"-s in a row if something is missing, e.g. UT064 låhkke;;reader; in Lule Saami table. Was it a right thing to do? In some cases, I will try to add the missing information, but in some cases adding gloss does not seem very reasonable, e.g. UT113 goasske;;mother’s older sister;
Does "~" cause problems, e.g. UT053 bårå-dak ~ bårå-k;eat-ABE;without eating;
What shall we do with ":"? If possible I would keep it as this was used for certain series, e.g. tjábbe : tjáppe-p : tjáppe-mus;beautiful : beautiful-COMP : beautiful-SPRL;beautiful : more beautiful : the most beautiful; The idea was that they would finally end up like this:

tjábbe : tjáppe-p : tjáppe-mus beautiful : beautiful-COMP : beautiful-SPRL beautiful : more beautiful : the most beautiful

If the separator ";" is there, could we somehow say that keep the colon in the same row or so?

Do "[]" cause problems? They are used for IPA in the phonology part but also in the case of some languages written in Cyrillic. I was adding ";"-s to Komi examples to ensure that this would end up in a separate row, e.g. новлы-тöм паськöм;[novlɨ-tɘm pɑɕkɘm];wear-PTCP.NEG clothes;clothes which have not been worn; This should be the outcome:

новлы-тöм паськöм [novlɨ-tɘm pɑɕkɘm] wear-PTCP.NEG clothes clothes which have not been worn

In Komi table "[]" are also used in glosses, e.g. key[NOM].

Examples in the phonology part. This is probably the most problematic part. Sometimes there is a gloss, sometimes a translation, sometimes both. Maybe I would say that starting from UT116 everything could be as it is. Besides often one whole example is as long as the example sentence in the syntax part, e.g.

muorra [muorra] tree.NOM.SG vs. muora [muora] tree.GEN.SG

In principle it also could be featured as follows in the tables, which would mean adding ";"-s everywhere:

muorra [muorra] tree.NOM.SG

vs.

muora [muora] tree.GEN.SG

I remember that "vs" was problematic, but it feels important as some questions are about contrast.

xrotwang commented 3 years ago

A PR with the three files you worked on would be good. That would allow me to look at all the examples relating to any of your points above. I think, having some numbers of how many problematic cases could be solved easily and how many remain would be good.

xrotwang commented 3 years ago

@MiinaNo Ok, I'll try to solve (automatically) as many of the remaining issues in these four sheets as possible, and then get back with comments regarding your questions.

xrotwang commented 3 years ago

@MiinaNo I think I got quite far with parsing the examples for the phonological features. One problem that I couldn't solve is the following:

I try to split multiple examples by comma or vs..
Of course, I should ignore commas in translations like 'hand, arm'.
So I only split on commas outside of "bracketed" or quoted text.
But there are quotes also within text content, tripping up my "quoted text" detection, e.g. păˈsan-ən table-LOC, wetˈraj-ət-ɑ bucket-PL-LAT.

What's the meaning of these single quotes in the transcription, e.g. păˈsan-ən? Can we remove/replace these?

MiinaNo commented 2 years ago

@xrotwang I checked the respective examples. Actually it should not be a single quote but a vertical line (the character code is 02C8). The vertical line is used to mark stress in questions UT116 to UT166. There are more such vertical lines in the tables of other languages. Unfortunately they cannot be removed... I started thinking that it is likely that someone has used the single quote to mark stress (which is incorrect) but when I checked the Kazym Khanty file, there seems to be a difference:

piri [ˈpi.riˑ] ’a kind of wild duck’

Or is the difference lost when the files are uploaded?

xrotwang commented 2 years ago

Ah, thanks. That's very helpful. So after a bit more checking:

The 02C8 characters are not lost on upload. The Kazym Khanty file contains 4 of these.

The formatting of stress is not consistent, though. So, e.g. here

păˈsan-ən table-LOC, wetˈraj-ət-ɑ bucket-PL-LAT, aˈnas-ət-a caravan.of.sledges-PL-LAT, păt'ʌam-a dark-LAT

the first three stress marks are correctly formatted, the last one is the ASCII single quote, though:

xrotwang commented 2 years ago

So for features UT116-UT166 I have the list of problematic examples down to 21. I think I can fix most of these by hand. Will add these changes to this PR.

MiinaNo commented 2 years ago

Good to hear, thanks!

MiinaNo commented 2 years ago

And as regards the stress mark I will try to pay attention that there is the right one when going through next tables.

xrotwang commented 2 years ago

Here's something that trips up my parser, but seems to be different from the stress mark issue:

cum’má [ˈt͡sumːæː] ‘kiss’ vs. cummá [ˈt͡suːmːæː] kiss.GENACC.SG

Here, "’" is used in the primary text transcription - but my parser confuses it with the start of a translation. Can we replace this with something else?

xrotwang commented 2 years ago

btw., here are my fixes: https://github.com/cldf-datasets/uratyp/pull/10/commits/ebbd394c1d3e9d650c96a0f8c49613a7e74a43bf

MiinaNo commented 2 years ago

Do you mean the one in the middle of the word cum’má? That is a hard one. I have to check what it is. it confuses even me.

xrotwang commented 2 years ago

Yes, that's what I mean. I sort of understand that some contrast is needed here. But maybe the contrast (with the additional length marker) in the IPA transcription is sufficient?

MiinaNo commented 2 years ago

I checked the table, it might be to do with the orthography. I found other examples, also in the syntax part, e.g.

Kás'sa lea beavddi vuolde

Would it be possible to replace it with sth that would not confuse your parser (sth that would look similar)?

xrotwang commented 2 years ago

Ok, will do some unicode shopping :)

MiinaNo commented 2 years ago

Sounds like a good plan :)

xrotwang commented 2 years ago

What about https://www.compart.com/en/unicode/U+201B ? It's used sometimes as alternative for the english apostrophe. Changes are here: https://github.com/cldf-datasets/uratyp/pull/10/commits/1f465a15a12c37860b77c3839f04aa8f7dae8322

xrotwang commented 2 years ago

And then there's

åadtjedh /ɔɐʨet/ [ɔɐʧeth] ‘to get, be allowed’

What does /.../ mean as opposed to [...]?

MiinaNo commented 2 years ago

Yes, let's check the one you are suggesting, i.e. the one that is used as an alternative to the apostrophe.

Yeah, both /.../ and [...] are used. Fortunately, mostly there is IPA. The two are not exactly the same. If the language expert was not able to provide IPA, we went for the phonemic one (it is a bit easier to produce). The difference is explained here: https://australianlinguistics.com/speech-sounds/phonemic-vs-phonetic/

MiinaNo commented 2 years ago

check > pick

xrotwang commented 2 years ago

@MiinaNo regarding question 4: Is the ~ supposed to mark reduplication (as in the Leipzig Glossing Rules)? If so, stripping the white space between ~ and surrounding morphemes would solve the problem.

MiinaNo commented 2 years ago

@xrotwang I did not actually know about this that ~ is also used to express reduplication. We used it actually to present two equally good options (e.g. there may be two equally prductive action nominalizers). But now I understand that it was not a good idea. Maybe I could simply use comma then? (Fortunately there are not many such examples.)

xrotwang commented 2 years ago

@MiinaNo I think we can still stick with ~ - I've encountered it meaning alternative options in other datasets, too - so I guess that's common practice. Do you think it makes sense to turn these cases into multiple examples - duplicating gloss and translation, if available? Or should the x ~ y be the primary text (or IPA) of just one example?

MiinaNo commented 2 years ago

Ok, good, duplicating gloss and translation sounds actually reasonable. Definitely gloss should be duplicated because often there are two different forms, whereas the translation is the same (but maybe it is not bad to show this).

xrotwang commented 2 years ago

I fixed a couple more examples in #10 . So I'd propose we merge #10 and then iron out the remaining issues?

MiinaNo commented 2 years ago

Ok, great, you can push the button :)

cldf-datasets / uratyp

Restructuring tables a bit #8