Strange four-digit COGID-s

Alexei-Kassian commented 10 years ago

Here is 'lez_indep_develop.xlsx', converted into 'lez_indep_develop.qlc': https://yadi.sk/d/djzv-9SibmcF7 Please note such COGID-s as 4079 or 3190. Is it ok?

LinguList commented 10 years ago

Okay, this is a bug thanks to *\ JavaScript!

For the record: the bug is due to wrong incrementation, since incrementation is used with a "+" but in this case it doesn't increment an integer, but a string instead. So

we need to refresh typechecking for integers in spreadsheet

LinguList commented 10 years ago

well, turns out, the bug is NOT from the spreadsheet, which did a good job in checking integers and the like, but from the LXSX file! Please check:

Keren Aghul, concept "cold"

It has COGID 3999 in the xlsx file.

Since we use a simple incrementation procedure, the result is as it is, accordingly.

Note, however, that using things like "3999" is not safeproof with current spreadsheet converter, since the spreadsheet converter assumes that cogids are assigned for each meaning in an increasing fashion (this is how current STARLING format in GLD looks like). So cogids as 3999 should not be used in this context but better be replaced by temporary negative IDs, or by increasing integers.

Spreadsheet needs to trust STARILNG xlsx to some extend. Otherwise, we need artificial intelligence, and not just a simple JS program...

Alexei-Kassian commented 10 years ago

COGID 3999 is valid for Starling. Starling COGID-s can be very different. I can merge two database, I can extract a part from a database. Finally I can use such numbers as 3999 for a specific purpose. But if Spreadsheet converts them properly and LingPy elaborates such COGID-s properly, ok.

LinguList commented 10 years ago

This depends on what starling actually does, and here, you need to be specific!

In earlier versions, all COGIDs were unique, no matter what concept they denoted, they would be different.

Now, they are unique only per CONCEPT (=word).

LingPy still follows the old paradigm of concept-independent cognate ids, since on the long run, we want to model cross-conceptual cognacy as well.

If there is no consecutive order for cognate ids in GLD input files, that is, if you use 1, 2, 10, 11, but not the numbers in-between for each conceptual slot, than I need to re-write the conversion of spreadsheet, since this is a security problem which might yield wrong cognate judgments in some analyses (not in MLN analyses, but in other applications).

Before rewriting this, however, it would be useful to have some account on current practice in STARLING of handling these things. They notably changed compared to earlier approaches, but so far, I couldn't find any description of this here.

Please re-open this issue, and provide the important information. I can then work on a solution.

Alexei-Kassian commented 10 years ago

Mattis, Starling COGID-s in lexicostatistical databases are unique only per concept. E.g., I can have ALL 1 2 1 998 -1 -98 1 6576 1 BARK 2 2 1 4 -10 -9 215 54 8546

"ALL 1" can be the same, but can be not the same as "BARK 1" !

Starling notation is similar to that of Dyen's IE database.

I mean exactly lexicostatistical databases. In ordinary etymological databases, COGID-s refer the entries of the head db. E.g., the Germanic db has references to the IE db.

Sometimes lexicostatistical db have the same structure, i.e., lexicostatistical CODID-s simply refer to the full etymological db. It means that the compiler of the lexicostatistical db give total credence to etymological solutions of the etymological db.

If so, "ALL 1" and "BARK 1" refer to the same record in the etymological db (if ALL and BARK could be etymologically related).

But nevertheless, reconstructing phylogeny, Starling only counts numbers within one concepts.

So my proposal is that

ALL 1 2 1 998 -1 -98 1 6576 1 BARK 2 2 1 4 -10 -9 2 54 8546

should be converted into qlc-format as

ALL 1 2 1 3 -1 -2 1 4 1 BARK 5 5 6 7 -3 -4 5 8 9

LinguList commented 10 years ago

All right. Thanks for clarification. This is doable, but requires a bit more computation from my side. So please re-open this issue (unless you haven't done already), and I will deal with it.

For the moment, however, the analyses will be consistent with MLN, no matter how we convert, since MLN also calculates on a concept-basis (the strange "PAP"-column in the output of MLN analyses). For this reason, analyses done with the current settings will be fine and not produce any errors due to potential security hole in the conversion.

In the future, however, I should fix this bug, in order to guarantee the most consistent of all possible conversions.

I need two weeks to fix this, since I will be traveling next week.

Alexei-Kassian commented 10 years ago

OK, great. Thanks.

P.S. How can I reopen this thread?

LinguList commented 10 years ago

If you read this thread from the webbrowser and not respont to it via email (as I just did), just look below, next to the "comment" button, where it says: "reopen and comment". No need to bother right now, because, I already reopen myself with this comment (just thought it was also useful for you to know of this feature)))).

Alexei-Kassian commented 10 years ago

I've checked my closed threads: The only I see is the green button "Comment". No option "Reopen" is available. Apparently I haven't rights for that operation.

LinguList commented 10 years ago

I just invited you to the "team", so now you should've full access to the repository.

dighl / spreadsheet

Strange four-digit COGID-s #2