Implement unique IDs for corpus entries

kvesik commented 2 years ago

Currently the unique key for each entry is its gloss; however, given that we might have a lemma, several actual productions, etc, of the same sign, this will not be sufficient moving forward. Sign-level info for each corpus entry should display a unique, auto-generated, non-editable ID assigned to the entry. Users can specify in global settings if there is a particular format, number of digits, etc that the ID should take.

kvesik commented 2 years ago

@kchall should users be able to specify a specific format (eg SIGN0002) for unique IDs, or just the number of digits used?

kchall commented 2 years ago

Yes, I had been thinking that we'd allow quite a bit of flexibility, so that e.g. users can ensure that their IDs are also likely unique across corpora and useful for other purposes, like identifying the contents:

I was thinking of global options along the following lines: Auto-generate based on:

[ ] current date

[ ] coder [taken from metadata]
    [ ] Name
    [ ] Initials
    [ ] Identifier 
    [ ] Other; enter text: 

[ ] signer [taken from metadata]
    [ ] Name
    [ ] Initials
    [ ] Identifier 
    [ ] Gender
    [ ] Date of Birth
    [ ] Age
    [ ] Language
    [ ] Other; enter text: 

[ ] source [taken from metadata]
    [ ] Name
    [ ] Initials 
    [ ] Identifier
    [ ] Other; enter text: 

[ ] recording [taken from metadata]
    [ ] Signer
    [ ] Location
    [ ] Source
    [ ] Date 
    [ ] Age
    [ ] Language
    [ ] Other; enter text: 

[ ] sequential number in corpus
    [ ] enter number of digits to include (system will add leading zeros): ____
    [ ] manually enter starting number: ____

[ ] additional text: enter _____

Select element delimiter: ( ) - (hyphen) ( ) _ (underscore) ( ) . (period) ( ) (none)

Select date format:

( ) YYYY-MM-DD
( ) YYYYMMDD
( ) YYYY-MM
( ) YYYYMM
( ) YYYY

Not sure if there's an easy way for people to also indicate the order they want things in, but maybe they could type in number of order instead of having checkboxes?

e.g. if _ is the delimiter and the elements are (1) sequential number with 4 digits, (2) current full date, and (3) source: 0001_2022-02-11_CD-ASL

kvesik commented 2 years ago

From 20220321 theory meeting:

There are two purposes for this kind of identifier:

For the software to have a unique way to identify each entry (in a corpus).
For the user to have a way to label/identify each entry - eg a fieldwork type of system.

If a user is designing special formatting for their labels/IDs, they are likely to be completely unique (in one corpus as well as across many/all), but not completely guaranteed.

If we use something like a to-the-microsecond timestamp (either on its own or in combination with a user-defined format), then they are basically all but guaranteed to be unique across all corpora and/or the universe. :)

I suspect the best approach would be to have a visible (but non-editable) ID whose format is user-defined, along with an invisible underlying ID that the software can use as a unique key for each entry.

kvesik commented 7 months ago

Note for documentation: only the "counter" (sequential numbering) attribute of EntryID will be stored with the entry id object. All other contentful attributes (date created, coder info, etc) and settings (delimiter, date format, etc) will be pulled on the fly from either the parent Sign (for attributes) or the app QSettings (for settings).

In particular, this means that if a user changes (eg) the coder field for a particular sign entry, then the EntryID will change as well... assuming that coder was, in fact, one of the visible attributes in the EntryID display string.

PhonologicalCorpusTools / SLPAA

Implement unique IDs for corpus entries #18