bepaald / signalbackup-tools

Tool to work with Signal Backup files.
GNU General Public License v3.0
790 stars 38 forks source link

Generate searchpage to search messages in HTML export #141

Closed bepaald closed 7 months ago

bepaald commented 1 year ago

A early version of a searchpage can now be generated when exporting HTML by adding the --searchpage option. This will create searchpage.html and searchidx.js. The latter contains all messages in the databases (in threads that were exported) to enable the search. The searchpage itself requires javascript.

Issues/things that need work/notes:

This is mostly a proof of concept: I expect the search to actually find the requested messages, and link to the correct page. Feedback is appreciated (I may be busy the next few weeks, so dev might be slow).

bepaald commented 1 year ago

And of course, right as I create this issue I find the first problem in the search results... I'll look at it later, late for work now.

Meteor0id commented 1 year ago

Do you expect the cutoff of long messages and also the cutoff of the number of search results to be overcome in the future? Cause for me a search is only useful if I am actually searching everything. Just curious what you expect to be possible.

bepaald commented 1 year ago

Yes to both, I just didn't want to spend the time on it if none of this would turn out to be usable. But I fixed/implemented both now.

The long messages are just something I had forgotten about, they are technically attachments in the database. (up until a few days ago, all --export* functions also cut them off, but apparently nobody noticed :) ). Anyway, fixed now.

I also added 'next' and 'previous' buttons to navigate to all the search results. The cutoff was just a quick solution to the fact that when you search something stupid, like 'e' or '$' with regex enabled, the browser will hang and likely crash.

I'm still not sure about this feature, maybe it's a problem in my javascript (I'm really learning as I go with this), maybe it's the size of my database (~14mb for the searchindex), but I feel the page is slow and cpu usage is high even on my fairly recent hardware.

kohms commented 1 year ago

Wow, thanks for implementing, it's quite a nice addition to the previous HTML representation. I tried loading my ~7GB backup which produced a 10.4MB searchidx.js file. It was surprisingly fast considering that it contained 72,554 lines. I checked the heap size usage in Chrome and the heap snapshot only took 11.1 MB. To me that's a quite nice feature to search a single backup archive and it covers most of my use cases already, as I can search thru the entire HTML pages, even if they are paginated.

One minor thing I noticed is that the representation of German umlauts is not working correctly on the search page. Umlauts and emoticons are not displayed correctly, and you can't search for them as well (e.g., searching for "zurรผck" won't work, while "zur" returned the right results)

image Displaying does work in the original page and also correctly stored in the searchidx.js file. image

It is just a missing UTF-8 charset setting in the HTML, it works when I set it manually on the html result page. I just opened a PR in https://github.com/bepaald/signalbackup-tools/pull/142, but feel free to change/rearrange/add it to your code yourself ๐Ÿ™‚

Looks like that after the adjustment: image

I had a look into the index itself, sure, you can always optimize such things, like remove duplicated strings with IDs, a bit like database normalization, but not sure if that is worth it or if it overcomplicates the implementation. Search engines typically apply techniques like tokenizing and stemming and safe the position of hits rather than the entire message (you might already know, but here they outline concepts https://lunrjs.com/guides/core_concepts.html if you are interested).

But as mentioned, I am quite happy with the results already, quite a user experience improvement over grep ๐Ÿ˜‰ .

Potential things which could be considered are to add the search box into the conversation views. You could load the index file via ajax once someone clicks on the search box, to work around freezing browser windows on real large datasets. If it becomes really a problem, you could also think about splitting the index into multiple files which are loaded independently, e.g. if you only search in a given conversation, you could have a separate index file for that.

I quite like the simplicity of having a bunch of files on disk, but if you are willing to add a server component, you might be more flexible, you could use sqlite natively (like adding a small nodejs backend which interacts with sqlite) or even use it in the browser via wasm (https://sqlite.org/wasm/doc/trunk/demo-123.md). Optionally, you could maintain your own schema as needed with optimized indices. This would be probably a good way, if a single archive of all messages should be created and backup files are just the transport mechanism to offload new messages from your phone into the archive.

bepaald commented 1 year ago

Thanks for the feedback. Glad it is mostly functional.

I was working on another issue, while I also had a couple of changes to this code lined up, so a little update:

I had a look into the index itself, sure, you can always optimize such things, like remove duplicated strings with IDs, a bit like database normalization, but not sure if that is worth it or if it overcomplicates the implementation. Search engines typically apply techniques like tokenizing and stemming and safe the position of hits rather than the entire message (you might already know, but here they outline concepts https://lunrjs.com/guides/core_concepts.html if you are interested).

That's an interesting read, but indeed I do not know if it's worth the trouble in this case. As for stemming: maybe it's not desirable to get results for 'walking' when you search for 'walked'. And if you would want that you could just search for 'walk' yourself: manual stemming :)

Potential things which could be considered are to add the search box into the conversation views. You could load the index file via ajax once someone clicks on the search box, to work around freezing browser windows on real large datasets. If it becomes really a problem, you could also think about splitting the index into multiple files which are loaded independently, e.g. if you only search in a given conversation, you could have a separate index file for that.

This is not a bad idea. I might do that at some point in the future, but I'm taking a short little break from too large changes for the moment (I'm swamped at work currently).

I quite like the simplicity of having a bunch of files on disk, but if you are willing to add a server component, you might be more flexible, you could use sqlite natively (like adding a small nodejs backend which interacts with sqlite) or even use it in the browser via wasm (https://sqlite.org/wasm/doc/trunk/demo-123.md). Optionally, you could maintain your own schema as needed with optimized indices. This would be probably a good way, if a single archive of all messages should be created and backup files are just the transport mechanism to offload new messages from your phone into the archive.

Again, it seems overly complicated to me. Also, all these options require loading remote content, while I really very much like the fact that all the generated pages now work while offline. Especially considering the personal content and the fact that javascript is running (when --themeswitching or --searchpage are supplied), I think being able to see no activity on your network is an advantage.

Meteor0id commented 1 year ago

print width could be better otherwise looking great

bepaald commented 1 year ago

Thanks. I didn't think printing the search page would be useful, should I add a whole @media-css section to this page also? I could probably just copy it from one of the other pages...

kohms commented 1 year ago

Thanks for the feedback. Glad it is mostly functional.

I was working on another issue, while I also had a couple of changes to this code lined up, so a little update:

  • I've adjusted the look of the results slightly (the red highlighting on hover is gone), it looks a tiny bit better. But overall, I still think the results look too much like a conversation while they could be completely unrelated messages in different conversations.
  • Added links to the search page on the index and all conversations. When going to the searchpage from a conversation, the search should automatically be limited to that conversation.

I just tried the latest binary, that's quite cool, thanks ๐Ÿ™‚

  • Added the charset in the meta tag, like on all other generated pages. (Good catch, in my defense: it all works perfectly fine without the meta tag on my system).

๐Ÿ˜† yeah, not sure why Chrome is not using UTF-8 by default, I guess due to my German locale ๐Ÿค”

I had a look into the index itself, sure, you can always optimize such things, like remove duplicated strings with IDs, a bit like database normalization, but not sure if that is worth it or if it overcomplicates the implementation. Search engines typically apply techniques like tokenizing and stemming and safe the position of hits rather than the entire message (you might already know, but here they outline concepts https://lunrjs.com/guides/core_concepts.html if you are interested).

That's an interesting read, but indeed I do not know if it's worth the trouble in this case. As for stemming: maybe it's not desirable to get results for 'walking' when you search for 'walked'. And if you would want that you could just search for 'walk' yourself: manual stemming :)

Fair point, I guess the default search engine rules not necessarily apply for the signal search use case and in the end, a regex can probably express everything with enough fine tuning ๐Ÿ˜…

Potential things which could be considered are to add the search box into the conversation views. You could load the index file via ajax once someone clicks on the search box, to work around freezing browser windows on real large datasets. If it becomes really a problem, you could also think about splitting the index into multiple files which are loaded independently, e.g. if you only search in a given conversation, you could have a separate index file for that.

This is not a bad idea. I might do that at some point in the future, but I'm taking a short little break from too large changes for the moment (I'm swamped at work currently).

Totally understandable, you were quite active the last weeks when I watch my GitHub notifications and I guess it's also in the category of "don't fix it, if it's not broken" issues๐Ÿ˜„I don't hit performance issues yet, others might have bigger archives, but even those could be broken down by limiting the HTML export to a given timeframe and just do multiple exports from the same backup.

I quite like the simplicity of having a bunch of files on disk, but if you are willing to add a server component, you might be more flexible, you could use sqlite natively (like adding a small nodejs backend which interacts with sqlite) or even use it in the browser via wasm (https://sqlite.org/wasm/doc/trunk/demo-123.md). Optionally, you could maintain your own schema as needed with optimized indices. This would be probably a good way, if a single archive of all messages should be created and backup files are just the transport mechanism to offload new messages from your phone into the archive.

Again, it seems overly complicated to me. Also, all these options require loading remote content, while I really very much like the fact that all the generated pages now work while offline. Especially considering the personal content and the fact that javascript is running (when --themeswitching or --searchpage are supplied), I think being able to see no activity on your network is an advantage.

Agreed, I also like the fact, that HTML is probably a stable format which is readable in years from now. Once you have a server component you don't necessarily know, if the server will start on the OS available in 10 years from now. Also, (and I hope that will never happen ๐Ÿ˜‰ ) if for some reason this project will not be continued, you don't have a kind of vendor-lock in.

Huntingtor commented 11 months ago

I just discovered the --searchpage option. Thank you for this improvement, it works flawlessly.

I don't want to be greedy, but how about additional search options "Match case" and "Whole words only"?

bepaald commented 11 months ago

I don't want to be greedy, but how about additional search options "Match case" and "Whole words only"?

Whole words only: the implementation would get a bit messy I think, and the functionality is already there through the regex option. In regular expressions, the special metacharacter \b matches a word-boundary. So to search for the whole word 'test', you enable the regex option and search for '\btest\b'.

Match case: Sure, consider it done! (seriously I just added it a few minutes ago, let me know if it doesn't work as expected).

Huntingtor commented 11 months ago

Match case works as expected. Thanks for the regex hint.

bepaald commented 7 months ago

After more than 6 months I finally added the --searchpage option to the README and --help output. :1st_place_medal:

Quickly reading back I believe all issues and suggestions in this thread were dealt with (though correct me if I'm wrong). I believe this issue can be closed. If there are any more important suggestions or problems found, feel free to let me know of course. This issue, or a new one, can always be (re)opened.

Thanks to everyone here for your help!