Only show recents from the current set of characters

will-lynas commented 8 months ago

Before this change, when using rofimoji with different sets of characters, the recent characters may include characters which aren't available in the current set.

After this change:

fdw commented 8 months ago

Hey, that's a very valid point, thank you!

I haven't looked too deeply into the PR yet, but do I read it right that there is no limit anymore in the recent file? Instead (if I didn't miss something, which is very possible), we'd read all lines from there, compare each with the current character set and then take the first n from there? That does sound a bit inefficient, doesn't it? And I'd like to avoid a file growing limitless, if possible.

Unfortunately, I don't have a good solution now, but I'm wondering if it would help if we split the "recent" file up per character set. What do you think?

will-lynas commented 8 months ago

Yes, that's correct, and yes it is quite inefficient. We could split up the "recent" file into separate files based on the hashes of the character sets. Is that an acceptable overhead for each run?

will-lynas commented 8 months ago

If this change is included, is it worth adding something to convert the old recents file to the new format?

fdw commented 7 months ago

The way I read it (and please correct me if I'm wrong), it's now hashing the whole character set and using that as a filename. I think the general idea is good, but the hashing itself probably takes some time as well, and it breaks when the data changes (which is not that seldom).

I'm wondering if we could write one recents file per data file, so for example recents/emojis_smileys_emotion. That way, it's easy to pick the recent characters for the current run. The problem is finding out which file the current selection should go to...

If we do something like this, and if it's possible, it would be quite nice to migrate the old file. But it's not necessary, I'd say.

will-lynas commented 7 months ago

I've run some tests, and with the biggest character set included in rofimoji (~42000 characters) hashing only takes ~2ms, so I don't think that on its own is an issue.

I did think about just using the filenames, which would probably be fine for the internal data files (I assume their contents don't change too much?), but someone using their own file may change it significantly. Just using the filenames means that we have to check the characters in the recents file each time, to check if they are still part of the character set, which adds some amount of complexity.

There is also the question that you bring up of whether a character should be considered recent with respect to the whole character set or wrt the file it came from. Keeping track of which file each character comes from is tricky. And I presume there is nothing stopping a character being defined in multiple files, if someone is inclined to do so.

My preference would be to keep the simplicity of hashing whole character sets, given that performance doesn't seem to be an issue. And accept the tradeoff that the recents list will occasionally go blank, and need repopulating.

fdw commented 7 months ago

Thanks for the test regarding performance 🙂 That does help.

The other problem I have with hashing the characters is that at least the distributed data files are updated more or less regularly with new characters. Meaning the old hash doesn't apply anymore, the user starts with a blank one and we have a file lying around that won't be used again. I find that at least as bad as showing the wrong recent characters.

I'm not sure there is a good solution... maybe an acceptable alternative would be to let the user specify the recent file and let them take care of it?

will-lynas commented 7 months ago

Ok, in that case I think it is best to just leave it as it is then.

fdw commented 6 months ago

Hey, I'm sorry I vanished there; I never meant to ghost you! I just had a lot of other things to do.

But I'm still interested in the PR. Just had no time to think about what to do next.

fdw commented 6 months ago

Your closing this PR finally made me work on this problem 🙈

Storing one "recents" file per character set is not feasible, because we can't merge them - was the character from this set more recently used than the first one from that set? Hashing the queried characters and using that as an identifier, while it's only small overhead (as you've shown), is impractical because the contents of one character set can and do change every once in a while, and then you'd have no recent characters, but a file that isn't used anymore. Instead, I've now opted to have one file for each "fileset", i.e. a list of all the passed files. This still has the problem that the information isn't shared between different filesets with some shared characters. However, solving that would require us to have a complete list of the recently used characters of all sets and comparing each one with the currently queried set, and that is slow. I've wondered if a real SQLite DB would help with this (and it probably would solve this part of the problem), but it's another dependency for a quite simple program, and it smells like over-engineering to me.

Hope you like it 🙂 If you'd like me to list you as co-author on this (because you've done some work here), or if you have any other ideas, feel free to write something.

fdw / rofimoji

Only show recents from the current set of characters #173