[Feature Request]: Unique ID for each `SequenceInterval` / `SequencePoint`

JoFrhwld commented 11 months ago

What feature would you like added?

Each SequenceInterval and SequencePoint should get assigned a unique id. I've briefly looked at the modules hashlib and uuid, and I think uuid.uuid4() might be the way to go.

hashlib possibility

We could add a new property to the PrecedenceMixins mixins (which get drawn into both SequenceInterval and SequencePoint) like this:

@property
def id(self):
  return hashlib.sha1(pickle.dumps(self)).hexdigest()

Upside: easy to add into the mixins.
Downsides:
- The returned hash is deterministic from the contents of self as serialized by pickle.dumps. There is a possibility that more that one Sequence* could have the same content, but it's a little unlikely as timing information (.start and .end for intervals and .time for points) are part of the content.
- The hash would change as other changes are made to the interval...

uuid

We could add the following to the __init__ for both SequenceInterval and SequencePoint.

def __init__(self, ...):
   ...
   self.id = uuid.uuid4()

Upsides:
- The code looks lot nicer and straightforward to understand.
- The resulting uuids are also kind of nicer to look at.
- Returns a UUID object which has additional properties and methods. (str(self.id) for output)
Downsides:
- I'm unsure if there's a significant time cost to running uuid.uuid4() on every __init__ versus the hashlib @property approach.
- We definitely don't want to incorporate it as a @property, since it would generate a new uuid every single time .id is called!

Additional thoughts

The uuid approach doesn't have immediate implications for the fuse_* methods, since any properties not directly modified by the fuse methods remain in place. The uuid of the new interval will be the uuid of the interval which called .fuse_*.

What would the use case be for this feature?

When thinking about methods that might return data frames (#50), having an up-front unique-id for each sequence would be useful for any down-the-line data frame operations (joining, etc)
Looking ahead to an eventual fave-extract, which may produce multiple output files (formant tracks, point values, etc) having a uuid available will definitely be useful.
aligned-textgrid seems like a good place to put a uuid, rather than replicating it more than once in down-stream workflows.

Would you like to help add this feature?

Yes, and I will submit a pull request soon.

Code of Conduct

[X] I agree to follow this project's Code of Conduct

JoFrhwld commented 11 months ago

@chrisbrickhouse I'm not going to start working on this today. If you have opinionated thoughts about it, please share!

chrisbrickhouse commented 11 months ago

Well, Python guarantees that id(<instance>) is a persistent and unique ID across the lifetime of an instance (for CPython this is the memory address). Would this be sufficient? I think using the internal ID is better than rolling our own system unless we need it to be semantic or user-friendly.

Presumably we could do something like:

@property
def id(self):
    return id(self)

This should guarantee a unique ID regardless of content and without needing additional imports. The only time we'd get overlapping id() for two objects is if one is deleted before the other is created. That may be bad depending on the use case, but it seems from the use cases listed that this is implicated mostly for instantaneous operations/uniqueness conditions, not long term tracking.

JoFrhwld commented 11 months ago

That's a good point, and looking at the values that id() returns, they're much shorter, which is good for reducing visual overwhelm.

mostly for instantaneous operations/uniqueness conditions, not long term tracking.

I was thinking a bit more generally. For example, fave-classic currently outputs two files, if --tracks is set

XYZ.txt : The point measurements
XYZ.tracks : The full formant tracks

fave-classic doesn't provide any cross-identifying ID for entries across the two files, which is what I'm aiming for.

I'm also thinking about possible further down the line possibilities, like if people wanted to do multiple different analyses, each written to different outputs, but crossreferenceable. There might even be a longer term goal of making the ids stable outside of an individual session (like, writing to json or yaml of some kind) but that's further down the road.

chrisbrickhouse commented 11 months ago

So, for clarity I've been considering this as largely a graph problem. Interval and point objects are leaf nodes and tiers are root nodes above them (there's more connections but for the moment it's a sufficient model). The goal is, for every leaf node, to have a Universally Unique IDentifier which is:

Universal: persistent within a session and across sessions
Unique: the probability that two leaf nodes have the same ID is functionally 0
IDentifier: the symbol indexes a particular piece of data

The UUID library gets us close, but because of how our software works, the UUIDs will never be persistent across sessions. Hashing the contents is potentially universal, but there's a major failure mode: the contents are mutable even within a session. If a user changes, e.g., the label or start/end times, we don't want the ID to change but the hash of the contents will change.

Instead, what if the responsibility for assigning UUIDs was put on the root nodes rather than the leaf nodes? We can guarantee uniqueness because every root will have knowledge of every leaf it commands. We can guarantee universality by creating an assignment scheme that is reproducible. A simple example, roots could assign sequential integers to their children. If a child is added, it's UUID is the next integer. If a child is removed, the integer will already have been moved past and won't get assigned again. As long as the TGs are parsed the same across sessions, the IDs should reproduce. We could add in, for example, the input file name hash as an offset to ensure that the integer sequences start at unique positions, or simply not guarantee uniqueness across sessions.

JoFrhwld commented 11 months ago

I like that! In the spirit of ids as tree locations, maybe there should be a "relative" id (integer underneath a branch) and an "absolute" id (concatenation of the full path).

The one place it might run into trouble, and we might just have to decide this is out of scope, is if a TextGrid is read in, and a .fuse_*() is done, then the TextGrid saved. When that TextGrid is re-parsed, the ids will all be off-by-one within the containing root. Maybe the only thing to do is add a caveat either in the docs, or a warning.

chrisbrickhouse commented 11 months ago

I hadn't thought about fuse... and that actually would apply to any insert or remove operation. I think what we're running up against is that unless we create our own file format to store metadata, we can't guarantee that IDs will persist across sessions because there's no invariant object to base it on (or at least I can't identify one).

In the spirit of ids as tree locations, maybe there should be a "relative" id (integer underneath a branch) and an "absolute" id (concatenation of the full path)

Well, my larger thought is that a lot of these relationships can be abstracted into an adjacency matrix. Basically, if we shunt ID assignment high enough, we can just store these instances as values with IDs as a keys in a global hash table, and all the the relationship information and operations can be done on the matrix rather than the instances themselves.

Forced-Alignment-and-Vowel-Extraction / alignedTextGrid