dominikbraun commented 3 years ago

The problem

When renaming a project, the project file - whose name corresponds to the project key - will be renamed as well (see #81). The project key is also stored for each record associated with that project. At the moment, when renaming the project, all project keys in the record files have to be renamed. In other words, renaming a project involves traversing and editing all record files since the project key is used to associate the record withe the project!

This is bad and doesn't scale well. I strive for a solution with O(1) runtime complexity for common scenarios.

The solution

The obvious solution to this problem is to separate the reference stored in the records from the project name. Such a separation can be accomplished by generating an ID for each project, which could be a simple hash value like a3c03f8f. Instead of using the project name in the record files, this ID would be used.

I've found three approaches to implement this.

ID as filename, ID as reference

The easiest approach is to not store a project like make-coffee as make-coffee.json and refer to it as make-coffee, but to store it as a3c03f8f.json and refer to it as a3c03f8f instead.

Advantages

This approach is simple to implement.

Disadvantages

When starting tracking using timetrace start make-coffee, the project ID for make-coffee has to be determined in order to store it in the record file. This involves traversing all project files, opening them and checking their key. With a runtime complexity of O(n), this approach is out of question.

Project key as filename, ID as reference

Another approach is to store the project make-coffee as make-coffee.json, refer to it as a3c03f8f in record files and maintain a mapping of the project ID against the project key. For example, there would be a file called a3c03f8f which would contain make-coffee as project key.

Advantages

When displaying a record that stores a project reference like a3c03f8f, looking up the project data is an O(1) operation.

Disadvantages

On the other hand, when starting tracking using timetrace start make-coffee, looking up the project ID for make-coffee in order to refer to it in the record file still is an O(n) operation. This is also true for other commands like create record or get project.

ID as filename, project key as reference

The previous approach can also be reversed: The make-coffee project can be stored as a3c03f8f.json and referred as make-coffee in the record files while maintaining a mapping of the project key against the project ID. For example, there would be a file called make-coffee containing the ID a3c03f8f.

Advantages

Looking up the project ID for a given key, for example when starting tracking or displaying a project, is an O(1) operation. Additionally, timetrace internally could exclusively use generated IDs, and project keys would merely be an "alias" or human-readable pointer to a project - just like a Git tag is a human-readable pointer to a commit.

In theory, this approach would still result in an O(n) complexity when projects have to be looked up the other way around, i.e. when finding the project name for a given ID. The timetrace list records command would be an example for this, as the project name for each reference stored in the records has to be displayed. In practice however, under the above mentioned assumption that timetrace internally exclusively uses IDs, the project can be loaded directly by its ID without any indirection.

The result is that this approach scales with O(1) for all operations.

Disadvantages

At this time, I don't see any drawbacks.

Conclusion

My suggestion is to implement project references as explained in the third approach.

PTAL @aligator @joshuaherrera

jwnpoh commented 2 years ago

Hello, with respect to to the disadvantage of approach 1, has a SQL-like implementation been considered? Without necessarily migrating to a database, we could implement something similar by maintaining a projectsDB.json file that contains a single json object of key:value mapping all project keys with their IDs. So instead of traversing all project files, opening them to determine their project key, we just need to open one file, unmarshal the json into a map in Go from which we can retrieve the project key.

projectsDB.json:

{
    "make-coffee" : "a3c03f8f",
    "make-tea" : "b4d14g9g",
    "another-project" : "c5e25h0h"
}

timetrace start make-coffee would thus have our program look into projectsDB.json and retrieve a3c03f8f using the make-coffee project key.

dominikbraun commented 2 years ago

Hello, with respect to to the disadvantage of approach 1, has a SQL-like implementation been considered? Without necessarily migrating to a database, we could implement something similar by maintaining a projectsDB.json file that contains a single json object of key:value mapping all project keys with their IDs. So instead of traversing all project files, opening them to determine their project key, we just need to open one file, unmarshal the json into a map in Go from which we can retrieve the project key.

@jwnpoh Yes, this is what I was thinking of. Would there be any difference between your approach and approach #3?

The previous approach can also be reversed: The make-coffee project can be stored as a3c03f8f.json and referred as make-coffee in the record files while maintaining a mapping of the project key against the project ID. For example, there would be a file called make-coffee containing the ID a3c03f8f.

Advantages

Looking up the project ID for a given key, for example when starting tracking or displaying a project, is an O(1) operation. Additionally, timetrace internally could exclusively use generated IDs, and project keys would merely be an "alias" or human-readable pointer to a project - just like a Git tag is a human-readable pointer to a commit.

In theory, this approach would still result in an O(n) complexity when projects have to be looked up the other way around, i.e. when finding the project name for a given ID. The timetrace list records command would be an example for this, as the project name for each reference stored in the records has to be displayed. In practice however, under the above mentioned assumption that timetrace internally exclusively uses IDs, the project can be loaded directly by its ID without any indirection.

The result is that this approach scales with O(1) for all operations.

jwnpoh commented 2 years ago

@dominikbraun It does seem to be the same. Would I be correct in understanding that with approach #3, renaming a project would still require going through all the records and changing the project key in the record files, thereby still resulting in an O(n) complexity?

dominikbraun commented 2 years ago

@jwnpoh In theory, no, because we'd store the project IDs in the records, not the project keys. Renaming a project would just mean to remove the key from the index and create a new key that points to the project ID.

In practice however, I'm not sure how this could be done backwards-compatible so that old records with keys in them still work... 🤔

dominikbraun commented 2 years ago

In practice however, I'm not sure how this could be done backwards-compatible so that old records with keys in them still work... 🤔

We could simply treat current project keys as those new numeric IDs and add the project key to the index.

For new projects:

{
    "new-project": "abc123"
}

For existing projects:

{
    "existing-project": "existing-project"
}

That way, the project ID lookup is the same.

jwnpoh commented 2 years ago

In theory, no, because we'd store the project IDs in the records, not the project keys

This makes sense. I was confused by this line in the original proposal:

The make-coffee project can be stored as a3c03f8f.json and referred as make-coffee in the record files...

Can I clarify that this disadvantage:

In theory, this approach would still result in an O(n) complexity when projects have to be looked up the other way around, i.e. when finding the project name for a given ID. The timetrace list records command would be an example for this, as the project name for each reference stored in the records has to be displayed.

results from the complexity of doing a "reverse lookup" in Go's map structure? Having to range through all the keys in the map to find a match for the value?

If so, would it make sense, to maintain a two-way map in projectsDB.json? Other than it being somewhat cumbersome and maybe requiring more code during the project initialisation and project renaming functions.

{
    "make-coffee" : "a3c03f8f",
    "a3c03f8f": "make-coffee",
    "make-tea" : "b4d14g9g",
    "b4d1g9g" : "make-tea",
    "another-project" : "c5e25h0h",
    "c5e25h0h": "another-project"
}

When renaming a project, we first lookup the ID using the reference as the key, store the ID in a variable and set the new key, then use the ID stored in the variable as the key to lookup the reference and set the new value. For example, renaming project make-coffee to make-latte:

id := projectsDB["make-coffee"]
delete(projectsDB, "make-coffee")
projectsDB["make-latte"] = id
projectsDB[id] = "make-latte"

There'll probably need to be an if statement before this to handle the existing projects:


id := projectsDB["make-coffee"]
if id == "make-coffee" {
    // some other code 
    return
}

delete(projectsDB, "make-coffee")
projectsDB["make-latte"] = id
projectsDB[id] = "make-latte"

dominikbraun commented 2 years ago

Can I clarify that this disadvantage:

In theory, this approach would still result in an O(n) complexity when projects have to be looked up the other way around, i.e. when finding the project name for a given ID. The timetrace list records command would be an example for this, as the project name for each reference stored in the records has to be displayed.

results from the complexity of doing a "reverse lookup" in Go's map structure? Having to range through all the keys in the map to find a match for the value?

Yes. And your approach where a two-way mapping stored would work fine - but I think it would be ok to load the individual projects by ID and display their project key.

A two-way mapping would indeed remove the need for this as the project key would be directly available. However, this two-way mapping has to be taken into account by every command that adds, removes or changes a project key, which is a slight management overhead. I think loading all projects by ID for reading their key would suffice as long as we don't run into performance issues.

dominikbraun / timetrace

Proposal: Project references #156

The problem

The solution

ID as filename, ID as reference

Advantages

Disadvantages

Project key as filename, ID as reference

Advantages

Disadvantages

ID as filename, project key as reference

Advantages

Disadvantages

Conclusion

Advantages