elexis-eu / lexonomy

A cloud-based, open-source system for writing and publishing dictionaries.
http://www.lexonomy.eu/
MIT License
86 stars 29 forks source link

Auto-numbering feature #273

Closed iztokkosem closed 2 years ago

iztokkosem commented 2 years ago

We have been looking at this feature and we are unsure how exactly it works (it would be good to have some documentation), namely:

I would think that the most useful way would be that the user would select whether they want to use auto-numbering for their dictionary or not, and that would from then on be automatic for every new added element.

One related thing - Carole said the auto-numbering feature might have caused a bug that a dictionary did not want to reindex.

iztokkosem commented 2 years ago

We urgently need to discuss this feature as I believe it is very important for dictionary linking in Lexonomy, and hence for ELEXIS Linking. Namely, as Adam explain, the current IDs for auto-numbering are composed of a headword and sense number, which means they might not be unique (in case of homonyms).

Furthermore, there is a problem with how this numbering works, and integrates with externally imported IDs.

In any case, this has become a rather urgent matter, to be discussed at the next meeting.

mjakubicek commented 2 years ago

It is only a problem for homonyms if you have them as separate entries, if you keep them within one entry (which I think we should advise people to do generally), it is ok.

rambousek commented 2 years ago

Alternatively, we can use "entry ID + sense ID" or "headword + pos + sense ID".

Dne po 23. 5. 2022 11:24 uživatel Milos Jakubicek @.***> napsal:

It is only a problem for homonyms if you have them as separate entries, if you keep them within one entry (which I think we should advise people to do generally), it is ok.

— Reply to this email directly, view it on GitHub https://github.com/elexis-eu/lexonomy/issues/273#issuecomment-1134416139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVEO2QSGJG6V2YMSGNA43VLNFD3ANCNFSM5ULDNTIA . You are receiving this because you were assigned.Message ID: @.***>

mjakubicek commented 2 years ago

Wait -- "headword + pos + sense ID" (sense ID being the sense number) is the current status, or not? This should be possible to setup generally (I think we discussed this), using the same notation that is used elsewhere, default being "%(headword)-%(pos)-%(sense_nr)" or something like that.

KCMertens commented 2 years ago

In the new system I've made it so that whenever an entry is saved (both new and existing entry update) the entire backend process is run from scratch (this was already mostly the case, but now it's exactly the same for new entry, updating entry, re-indexing entry [when config or subentries changed]), this means:

We can add an extra step for autonumbers and linking here An extra step here that checks and fixes invalid autonumber and linkables is easy to add: I just need to know the contraints of the system.

iztokkosem commented 2 years ago

This is great, Koen. My main worry stays in how the IDs are formed. I can already think of several examples where we have the same headword (form) and same part of speech. This was actually one of the main reasons we moved to numbers only for the Slovenian data. Plus, using sense number makes little sense as you can change sense order and then it can be only misleading.

I would use numbers for any element selected for auto-numbering. That is way easier to control, and is integratable with external solutions.

Perhaps we should also ask ELEXIS partners on how they form IDs for dictionary entries and senses. I think Lexonomy should also support the option to adopt the external IDs, and in that way simply having numbers in ids works easier - you pick the number that the numbering should start with.

mjakubicek commented 2 years ago

I strongly disagree with this approach, sorry. Using natural IDs must be enforced whenever possible. We discussed this issue of homographs multiple times and I really think this should be avoided -- and we did avoid it in the LEXIDMA model, by the way. On presentation layer, things maybe visualized arbitrarily, but in the data model/database, same headword and same PoS = same entry; and if not, one shall provide another key as a disambiguator, so that there is always a unique natural key.

The autonumbering feature for senses is a poor's man solution we added for the cases where you've got some sort of rudimentary senses that do not provide any disambiguators (explanations, glosses, definitions, whatever...). It is completely fine that they are unrelated to ordering and potentially incomplete. A lemma-pos combination may have sense number 3, 7 and 10, and visualized in the exactly opposite order e.g. by frequency (and numbered 1, 2 and 3 for the end user, if anyone thinks users would appreciate the numbers).

I'm not quite sure what you mean by:

I would use numbers for any element selected for auto-numbering. That is way easier to control, and is integratable with external solutions.

But keep on mind that those numbers might be referenced in other dictionaries, so changing them shouldn't be done unless strictly necessary. Generally, they should be seen as some last-resort unique identifiers and that's it.

mjakubicek commented 2 years ago
  • check flagging (flag is now a column in database for faster searching)

Great, this was long on the agenda. Can you please reference the commit that fixes this?

rambousek commented 2 years ago

TEI dictionaries and Elexified dictionaries use xml:id attribute with "dictionary - headword - pos - sense number" combination, eg. xml:id='SLB_absurdnost_1_noun_1', which is unique enough

iztokkosem commented 2 years ago

I'm summarizing the conclusions of our discussion at the meeting:

Others, please add any info I might have forgotten.

mjakubicek commented 2 years ago

Reading this raises lots of questions:

I'm summarizing the conclusions of our discussion at the meeting:

  • auto-numbering needs to be made automatic, so that once activated, it adds a number to the selected id for each newly added element.

This was always meant as one-off auto-numbering. Making this a continuous process is extremely dangerous and almost never desirable and should certainly be, if at at implemented, only optional. The only way this is currently used is linking, and before linking, the users should be given this option (when needed).

The most critical aspect is though that numbers will never be reused (so always auto-increment the last generated not the last used in the data).

  • we provide both options so just a number and a combination of different elements. Of course, the headword+pos+sense-number option goes for senses only, so to make it generic, the user needs to predefine which element values to use. As we discussed, there are many potential problems if we want to keep the id stable, i.e. non-Latin characters, problems with adjusting for changes in the entries (e.g. changing sense order, changing pos info) etc., so this human-friendly form comes with some caveats for the users.

The problem here is that what people think that is a benefit of artificial (non-natural IDs) is actually their biggest disadvantage. If you have a link based on PoS and you change the PoS you do want the link to get broken, so that someone checks it still makes sense (chances it will not are very high). Working with datasets like wordnets using artificial IDs all over the place is a good nightmare-example of where this leads: people think everything is ok and underneath the data deteriorates without being noticed (been there, seen that, sorry).

non-Latin characters are just a technical issue that is easy to handle (via urlencoding or some other transliteration).

I'm not sure what you mean by "both options"

  • If choosing the numbering option, the user can also enter the value from which the numbering in a certain element starts.
  • based on all these shortcomings, we need to use numbers for ELEXIS linking purposes.

Reading this makes me wonder whether you actually intend the numbers to work across senses -- this is certainly NOT what this feature was meant for. The number should be local for (e.g.) a given headword-pos, not across the whole dictionary; but always local with regard to the closest parent.

Not sure if I got it right, but if you mean that the ELEXIS link will be based only on numbers (so a link is a pair of two random "IDs"), that is a very, very, very bad decision that will bounce back eventually if it is actually implemented so.

simonkrek commented 2 years ago

I'm not sure if I understand all Milos' arguments but I believe that in the world of semantic web (and (ELEXIS) linking) unique entry and sense identification is rather important. As far as I know, WikiData uses IDs - Lexeme, Item, Wikipedia and DBpedia use character strings. On the other hand, Slovenian portal Fran uses a combination of a numbered ID and string, e.g. ljubezen. The 133 represents a resource ID which is followed by a human readable string, followed by the ID of the entry and a human readable headword. Perhaps a combination can ensure stability on one hand, and flexibility and readability on the other.