Open MiinaNo opened 3 years ago
ad 1) I think consolidating the source info in the Sources
column would be really good. If a source relates specifically to the example given, this could be indicated as source1;source2[example]
.
ad 2) I wouldn't remove the examples for the 0 (i.e. negative) answers. If anything, we could hide them - or clearly label them - in the web app. But they sure provide additional context for answers.
1) Great, this is very good advice. 2) Actually you are right, I started looking at the examples, and it feels a bit sad just to remove them.
Maybe I could take one table, make the changes and show the outcome to you before continuing with the next table.
Yes, that sounds like a good plan.
@xrotwang
I have been now working with three tables to get a better picture (Lule Saami, Komi Zyrian and Pite Saami). I am writing my questions/comments here. Should I maybe do a pull request for Lule Saami table so that you could compare what I have done? I also copied relevant examples below.
Sources. I did it now as asked. I also added p.c. after the name as done in GB tables. In the case there were two examples from different sources, I used source[example1];source[example2]. See e.g. Kazym Khanty table.
Hidden examples. I somehow feel that in some cases it is not possible to separate the example from the comment, see e.g. UT140 in Lule Saami table. "There is a phonetic contrast between [æ] and [e], but there is no phonological opposition as they are in complementary distribution, e.g. [e] in germaj snake.NOM.SG vs. [æ] in gärmmaha snake.GEN.SG can be analysed as allophones of the same phoneme." I have to say that the hidden examples are the hardest case... I have no good solution.
I have been trying to remove all the ";"-s that were in any other meaning than line break. But was it fine to use "|" to show that another example is to follow? There are two ";"-s in a row if something is missing, e.g. UT064 låhkke;;reader; in Lule Saami table. Was it a right thing to do? In some cases, I will try to add the missing information, but in some cases adding gloss does not seem very reasonable, e.g. UT113 goasske;;mother’s older sister;
Does "~" cause problems, e.g. UT053 bårå-dak ~ bårå-k;eat-ABE;without eating;
What shall we do with ":"? If possible I would keep it as this was used for certain series, e.g. tjábbe : tjáppe-p : tjáppe-mus;beautiful : beautiful-COMP : beautiful-SPRL;beautiful : more beautiful : the most beautiful; The idea was that they would finally end up like this:
tjábbe : tjáppe-p : tjáppe-mus beautiful : beautiful-COMP : beautiful-SPRL beautiful : more beautiful : the most beautiful
If the separator ";" is there, could we somehow say that keep the colon in the same row or so?
новлы-тöм паськöм [novlɨ-tɘm pɑɕkɘm] wear-PTCP.NEG clothes clothes which have not been worn
In Komi table "[]" are also used in glosses, e.g. key[NOM].
muorra [muorra] tree.NOM.SG vs. muora [muora] tree.GEN.SG
In principle it also could be featured as follows in the tables, which would mean adding ";"-s everywhere:
muorra [muorra] tree.NOM.SG
vs.
muora [muora] tree.GEN.SG
I remember that "vs" was problematic, but it feels important as some questions are about contrast.
A PR with the three files you worked on would be good. That would allow me to look at all the examples relating to any of your points above. I think, having some numbers of how many problematic cases could be solved easily and how many remain would be good.
@MiinaNo Ok, I'll try to solve (automatically) as many of the remaining issues in these four sheets as possible, and then get back with comments regarding your questions.
@MiinaNo I think I got quite far with parsing the examples for the phonological features. One problem that I couldn't solve is the following:
vs.
.'hand, arm'
.păˈsan-ən table-LOC, wetˈraj-ət-ɑ bucket-PL-LAT
.What's the meaning of these single quotes in the transcription, e.g. păˈsan-ən
? Can we remove/replace these?
@xrotwang I checked the respective examples. Actually it should not be a single quote but a vertical line (the character code is 02C8). The vertical line is used to mark stress in questions UT116 to UT166. There are more such vertical lines in the tables of other languages. Unfortunately they cannot be removed... I started thinking that it is likely that someone has used the single quote to mark stress (which is incorrect) but when I checked the Kazym Khanty file, there seems to be a difference:
piri [ˈpi.riˑ] ’a kind of wild duck’
Or is the difference lost when the files are uploaded?
Ah, thanks. That's very helpful. So after a bit more checking:
păˈsan-ən table-LOC, wetˈraj-ət-ɑ bucket-PL-LAT, aˈnas-ət-a caravan.of.sledges-PL-LAT, păt'ʌam-a dark-LAT
the first three stress marks are correctly formatted, the last one is the ASCII single quote, though:
p 0070
ă 0103
t 0074
' 0027
ʌ 028c
a 0061
m 006d
- 002d
a 0061
So for features UT116-UT166 I have the list of problematic examples down to 21. I think I can fix most of these by hand. Will add these changes to this PR.
Good to hear, thanks!
And as regards the stress mark I will try to pay attention that there is the right one when going through next tables.
Here's something that trips up my parser, but seems to be different from the stress mark issue:
cum’má [ˈt͡sumːæː] ‘kiss’ vs. cummá [ˈt͡suːmːæː] kiss.GENACC.SG
Here, "’" is used in the primary text transcription - but my parser confuses it with the start of a translation. Can we replace this with something else?
btw., here are my fixes: https://github.com/cldf-datasets/uratyp/pull/10/commits/ebbd394c1d3e9d650c96a0f8c49613a7e74a43bf
Do you mean the one in the middle of the word cum’má? That is a hard one. I have to check what it is. it confuses even me.
Yes, that's what I mean. I sort of understand that some contrast is needed here. But maybe the contrast (with the additional length marker) in the IPA transcription is sufficient?
I checked the table, it might be to do with the orthography. I found other examples, also in the syntax part, e.g.
Kás'sa lea beavddi vuolde
Would it be possible to replace it with sth that would not confuse your parser (sth that would look similar)?
Ok, will do some unicode shopping :)
Sounds like a good plan :)
What about https://www.compart.com/en/unicode/U+201B ? It's used sometimes as alternative for the english apostrophe. Changes are here: https://github.com/cldf-datasets/uratyp/pull/10/commits/1f465a15a12c37860b77c3839f04aa8f7dae8322
And then there's
åadtjedh /ɔɐʨet/ [ɔɐʧeth] ‘to get, be allowed’
What does /.../
mean as opposed to [...]
?
Yes, let's check the one you are suggesting, i.e. the one that is used as an alternative to the apostrophe.
Yeah, both /.../ and [...] are used. Fortunately, mostly there is IPA. The two are not exactly the same. If the language expert was not able to provide IPA, we went for the phonemic one (it is a bit easier to produce). The difference is explained here: https://australianlinguistics.com/speech-sounds/phonemic-vs-phonetic/
check > pick
@MiinaNo regarding question 4: Is the ~
supposed to mark reduplication (as in the Leipzig Glossing Rules)? If so, stripping the white space between ~
and surrounding morphemes would solve the problem.
@xrotwang
I did not actually know about this that ~
is also used to express reduplication. We used it actually to present two equally good options (e.g. there may be two equally prductive action nominalizers). But now I understand that it was not a good idea. Maybe I could simply use comma then? (Fortunately there are not many such examples.)
@MiinaNo I think we can still stick with ~
- I've encountered it meaning alternative options in other datasets, too - so I guess that's common practice. Do you think it makes sense to turn these cases into multiple examples - duplicating gloss and translation, if available? Or should the x ~ y
be the primary text (or IPA) of just one example?
Ok, good, duplicating gloss and translation sounds actually reasonable. Definitely gloss should be duplicated because often there are two different forms, whereas the translation is the same (but maybe it is not bad to show this).
I fixed a couple more examples in #10 . So I'd propose we merge #10 and then iron out the remaining issues?
Ok, great, you can push the button :)
@JakeJing , @xrotwang There are actually two issues but I thought I could write in the same place as they are about restructuring some things in the tables:
Sources - When going through the sources, I started thinking that would it be better to have all the sources used for answering the questions in one and the same cell? At the moment they are scattered between the columns titled Sources, Examples, and Comments. Maybe I would still keep the source after the example (?) but I could move the sources used in the comments cell to the source cell in the csv-files? Would this be reasonable?
Hidden examples in comments (I copied Robert's e-mail below) I will go through the examples and either move them to the example row or delete them. I need to do the deleting in cases the comment is about an answer with a value 0. I think it would only confuse people to have am example if the value is 0. This is why we used the comments section but I think it was a bad idea.
I think all this is doable as there are 30 tables (not 3000) but I thought I will ask for a second opinion before I start :)
Roberts e-mail from 20 October: I just realized that quite a few examples seem to be hidden in comments, too. E.g. https://uralic.clld.org/languages/25
I think it would be useful to extract these into proper examples,