PhonologicalCorpusTools / SLPAA

5 stars 0 forks source link

side script: minimally-viable merge before LREC 2024 #299

Closed kvesik closed 6 months ago

kvesik commented 6 months ago

We do have a proper issue #275 for merging corpora, but Kathleen and I were discussing the fact that it would be useful to have even a super-basic merge function available by the end of April, so that we could combine corpora and curate signs that show particular characteristics that would be useful to talk about at LREC 2024.

kvesik commented 6 months ago

@kchall could you please try out the Merge corpora function (in the Analysis functions menu) in branch 299 and let me know if you have any concerns about the current interface, functionality, or behaviour.

FYI in this implementation, entryIDs are increased only as far as necessary. For example, if Corpus A has IDs 1, 2, 3, .., 20, and Corpus B has its minimum ID set to 10, with IDs 10, 11, 12, ..., 25, then the merged corpus will have IDs 1, 2, 3, ..., 20 (from A) and 21, 22, 23, ..., 36 (from B).

kchall commented 6 months ago

Looks great, thanks, @kvesik! This currently does not check for duplicated gloss names, correct? And just keeps both entries as separate entries with the same gloss? That's probably fine for our purposes, but if it's easy to have it tell us a list of the duplicated glosses, that might be useful so we can back-check. Thanks!

kvesik commented 6 months ago

Once the merge is complete, the status window shows not only the names of the successfully merged corpora, but also (if applicable) any duplicated glosses and/or duplicated lemmas.

image