bepaald / signalbackup-tools

Tool to work with Signal Backup files.
GNU General Public License v3.0
755 stars 36 forks source link

[Bug] --searchpage leading to endless loop of "search_idx query failed" and ever-growing HTML file #215

Closed bentolor closed 3 months ago

bentolor commented 3 months ago

First of all many thanks for this invaluable tool! I am very attached to my old conversations out of personal nostalgia and after a 4-year break I would like to export again my Signal collection spanning over many years for long-term archiving purposes. After revisiting all the other, mostly unmaintained alternatives like signal-back, sigbak and others, I stumbled over this gem.

Issue

I initially started my very first --exporthtml with all available bells'n'whistles including --searchpage running into this issue already on a very first thread. The export never stops. The affected HTML file grows and grows with the same message repeated over and over again.

signalbackup-tools (signalbackup-tools) source version 20240604.083452
BACKUPFILE VERSION: 1
BACKUPFILE SIZE: 5697670524
COUNTER: 4267989572
Reading backup file: 100.0%... done!
Database version: 231
Clearing contents of directory 'html'...
Dealing with thread 1
Dealing with thread 2
[Warning]: search_idx query failed 2 results
[Warning]: search_idx query failed 2 results
[Warning]: search_idx query failed 2 results
…

Trying to identify the issue

I'm a total C++ noob, but eventually I was able to spot the location and replicate the relevant SQL query executed into an additional Logger::warning issue.

This is the output on the relevant query

sqlite3 database.sqlite                                                                                                                                                                6.1m  Sa 08 Jun 2024 09:56:20 CEST
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> SELECT json_object('id', message._id, 'b', message.body, 'f', message.from_recipient_id, 'tr', thread.recipient_id, 'o', (message.type & 0x1F) IN (2,11,21,22,23,24,25,26), 'd', message.date_received, 'p', "<<<REDACTED_RECIPIENT>>> (_id2)/<<<REDACTED_RECIPIENT>>>.html") AS line, attachment._id AS rowid, -1 AS uniqueid FROM message LEFT JOIN thread ON thread._id IS message.thread_id LEFT JOIN attachment ON attachment.message_id IS message._id AND attachment.content_type = 'text/x-signal-plain' WHERE message._id = 7280
   ...> ;
{"id":7280,"b":"Im Volltext: \n\nSehr geehrte Frau <<<REDACTED_VERBATIM>>> oder die\nKombination einer routinemäßigen ","f":6,"tr":19,"o":1,"d":1618074357741,"p":"<<<REDACTED_RECIPIENT>>> (_id2)/<<<REDACTED_RECIPIENT>>>.html"}|4769|-1
{"id":7280,"b":"Im Volltext: \n\nSehr geehrte Frau <<<REDACTED_VERBATIM>>> oder die\nKombination einer routinemäßigen ","f":6,"tr":19,"o":1,"d":1618074357741,"p":"<<<REDACTED_RECIPIENT>>> (_id2)/<<<REDACTED_RECIPIENT>>>.html"}|4770|-1

My understanding

I'm not sure what actually causes this issue leading to a duplicate query result. But I'd like to suggest to

  1. Change the corresponding code to not loop indefinitely on an failure in a single row, but instead fail or skip gracefully
  2. If possible identify the root cause for this duplicate. I'm happy to provide additional debugging SQL statement outputs if that helps.

Again: Thank you very much for your great work and this awesome tool!

bepaald commented 3 months ago

Thank you so much for your detailed report. I will look into this shortly.

Change the corresponding code to not loop indefinitely on an failure in a single row, but instead fail or skip gracefully

That was certainly the intention, but I screwed up the logic. This will be quite easy to fix I believe.

If possible identify the root cause for this duplicate. I'm happy to provide additional debugging SQL statement outputs if that helps.

Yes, that is the real problem. I might indeed need some more output from you, however looking into the code and the information you already provided I have a theory. (I'm thinking the special circumstance is the message is an extremely long one (one that is not displayed fully, but says "Read more..." at the bottom) and has another such extremely long message as a quote)

I will be working on this now, it shouldn't take too long, I'll keep you updated. Thanks!

bentolor commented 3 months ago

Hey @bepaald ! Thank you for your quick and supportive response!

(I'm thinking the special circumstance is the message is an extremely long one (one that is not displayed fully, but says "Read more..." at the bottom) and has another such extremely long message as a quote)

I can confirm your guess: Indeed this is a very long message and the message text is cut-off. I couldn't find a quote of that message in that thread. But I think I forwarded the message to multiple other chats. Could this be the culprit?. Interesting edge-case.

Meanwhile I'm already more than happy, as the output I received w/o the --searchpage argument already provides me more value than ever. First time a fully accessible message archive on desktop.

bepaald commented 3 months ago

I've attempted a fix.

First of all, I fixed the logic, so if the query still fails, you should only see the warning (search_idx query failed 2 results) once, but the process should continue properly.

Second, I think I fixed the root cause (simply by adding another constraint to the query (AND attachment.quote = 0).

If you have some time, please test and let me know.

Thanks again for reporting!

EDIT Haha, we were typing simultaneously it seems.

I can confirm your guess: Indeed this is a very long message and the message text is cut-off. I couldn't find a quote of that message in that thread. But I think I forwarded the message to multiple other chats. Could this be the culprit?. Interesting edge-case.

Good, that makes it more likely it is now fixed. Though I hadn't considered the quoted message might not exist anymore. I don't think it makes a difference in this case though.

Meanwhile I'm already more than happy, as the output I received w/o the --searchpage argument already provides me more value than ever. First time a fully accessible message archive on desktop.

Thank you for your kind words. And thanks to bug reports like these, hopefully the program slowly gets better over time!

bentolor commented 3 months ago

I've attempted a fix.

First of all, I fixed the logic, so if the query still fails, you should only see the warning (search_idx query failed 2 results) once, but the process should continue properly.

It seems that part is now working as expected!

*** Starting log: 2024-06-08 14:00:32 *** 
signalbackup-tools source version 20240608.134549
BACKUPFILE VERSION: 1
BACKUPFILE SIZE: 5697670524
COUNTER: 4267989572
Reading backup file: 100.0%... done!
Database version: 231
Clearing contents of directory 'html'...
Dealing with thread 1
Dealing with thread 2
[Warning]: search_idx query failed 2 results
Dealing with thread 3

Second, I think I fixed the root cause (simply by adding another constraint to the query (AND attachment.quote = 0).

Obviously it seems to have a different root cause. While I'm typing this, the export is still running and I an additional warning scrolling by:

Dealing with thread 25
Dealing with thread 26
[Warning]: search_idx query failed 3 results
Dealing with thread 27

Hmmmm…

bepaald commented 3 months ago

Obviously it seems to have a different root cause. While I'm typing this, the export is still running and I an additional warning scrolling by:

Ok, that's a shame. In that case, I'm not sure what is causing the extra results in this case (the query should only have 1 result, hence the warning). Maybe you could run a query for me?

SELECT _id,message_id,content_type,remote_incremental_digest,remote_incremental_digest_chunk_size,cdn_number,transfer_state,transfer_file,data_size,file_name,fast_preflight_id,voice_note,borderless,video_gif,quote,width,height,caption,sticker_pack_id,sticker_pack_key,sticker_id,sticker_emoji,blur_hash,transform_properties,display_order,upload_timestamp,archive_cdn,archive_media_name,archive_media_id,archive_transfer_file,archive_transfer_state,thumbnail_file,thumbnail_random,archive_thumbnail_cdn,archive_thumbnail_media_id FROM attachment WHERE message_id = 7280;

It's probably a but overkill, but it might save me from having to ask again if I miss something. Also, if you're able to identify the other message (the one in thread 26), you could run the same query with 7280 replaced with that message's _id. It might not be necessary (it might simply be the same issue), but since it's returning 3 results instead of 2, it might be something else.

Thanks!

bentolor commented 3 months ago

Here we go. The other hit was the subsequent message. Seems to seconds my assumption of a cross-thread forwarded message?

sqlite3 plain/database.sqlite                                                                                                                                                                                        Sa 08 Jun 2024 14:43:01 CEST
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> SELECT _id,message_id,content_type,remote_incremental_digest,remote_incremental_digest_chunk_size,cdn_number,transfer_state,transfer_file,data_size,file_name,fast_preflight_id,voice_note,borderless,video_gif,quote,width,height,caption,sticker_pack_id,sticker_pack_key,sticker_id,sticker_emoji,blur_hash,transform_properties,display_order,upload_timestamp,archive_cdn,archive_media_name,archive_media_id,archive_transfer_file,archive_transfer_state,thumbnail_file,thumbnail_random,archive_thumbnail_cdn,archive_thumbnail_media_id FROM attachment WHERE message_id = 7280;
4769|7280|text/x-signal-plain||0|2|0||2241|signal-2021-04-10-190557.txt|-2550449821634623864|0|0|0|0|0|0||||-1|||{"skipTransform":true,"videoTrim":false,"videoTrimStartTimeUs":0,"videoTrimEndTimeUs":0,"videoEdited":false}|0|1618074359201|0||||0|||0|
4770|7280|text/x-signal-plain||0|2|0||2241|signal-2021-04-10-190557.txt|-2550449821634623864|0|0|0|0|0|0||||-1|||{"skipTransform":true,"videoTrim":false,"videoTrimStartTimeUs":0,"videoTrimEndTimeUs":0,"videoEdited":false}|0|1618074359201|0||||0|||0|
sqlite> SELECT _id,message_id,content_type,remote_incremental_digest,remote_incremental_digest_chunk_size,cdn_number,transfer_state,transfer_file,data_size,file_name,fast_preflight_id,voice_note,borderless,video_gif,quote,width,height,caption,sticker_pack_id,sticker_pack_key,sticker_id,sticker_emoji,blur_hash,transform_properties,display_order,upload_timestamp,archive_cdn,archive_media_name,archive_media_id,archive_transfer_file,archive_transfer_state,thumbnail_file,thumbnail_random,archive_thumbnail_cdn,archive_thumbnail_media_id FROM attachment WHERE message_id = 7281;
4771|7281|text/x-signal-plain||0|2|0||2241|signal-2021-04-10-190557.txt|-2550449821634623864|0|0|0|0|0|0||||-1|||{"skipTransform":true,"videoTrim":false,"videoTrimStartTimeUs":0,"videoTrimEndTimeUs":0,"videoEdited":false}|0|1618074359201|0||||0|||0|
4772|7281|text/x-signal-plain||0|2|0||2241|signal-2021-04-10-190557.txt|-2550449821634623864|0|0|0|0|0|0||||-1|||{"skipTransform":true,"videoTrim":false,"videoTrimStartTimeUs":0,"videoTrimEndTimeUs":0,"videoEdited":false}|0|1618074359201|0||||0|||0|
4773|7281|text/x-signal-plain||0|2|0||2241|signal-2021-04-10-190558.txt|-2550449821634623864|0|0|0|0|0|0||||-1|||{"skipTransform":true,"videoTrim":false,"videoTrimStartTimeUs":0,"videoTrimEndTimeUs":0,"videoEdited":false}|0|1618074359201|0||||0|||0|
sqlite> 
bentolor commented 3 months ago

BTW: In the signalbackup Output the message is rendered with an attachment. But not in the Signal app itself. Compare the two screenshots below:

Signalbackup-tool output

Hej - wow? Do you even adopt the configured chat colors? Nice! grafik

In the Signal App

grafik

bepaald commented 3 months ago

Thanks!

I was able to reproduce the issue by forwarding a long text to several contacts.

I strongly feel you have discovered a bug in Signal. With each additional contact a long message is forwarded to, an additional copy of the text is added to the message (note: these long messages are actually treated as attachments by Signal). I forwarded a long message to three contacts, and upon export I see this:

[...]
Dealing with thread 6
[Warning]: search_idx query failed 4 results
[...]
Dealing with thread 9
[Warning]: search_idx query failed 3 results
[...]
Dealing with thread 31
[Warning]: search_idx query failed 2 results

I will need to think about how to deal with it, I could very simply ignore any extra attachments of this type (text/x-signal-plain) as any message should only ever have 1 such attachment at most. But I need to think about whether it has any unwanted side effects in other circumstances. Not sure how quickly I get around to it, but I'll keep you informed.

I will at least also file a bug report at Signal, but in my recent experience they don't often respond very quickly.

Hej - wow? Do you even adopt the configured chat colors? Nice!

I try, but I think this is mostly coincidence: it doesn't often work. Maybe I'll dive into that when I have some time to spare.

Thanks again. I'll keep you updated.

EDIT bug report here: https://github.com/signalapp/Signal-Android/issues/13593

bentolor commented 3 months ago

Thanks for the research and very interesting analysis results!

I'm puzzled that this bug still seems to be present in a current Signal app installation and wonder why it does not render as attachment there. This is probably a rare border case.

I'm not sure if it really should be tackled as signalbackup-tool. Unfortunately I share the experience that Signal bug reports rarely get any feedback.

I'm good for now. Again: Thank you very much for your support and I'll also subscribe to your new upstream bug out of pure curiosity.

bepaald commented 3 months ago

I'm puzzled that this bug still seems to be present in a current Signal app installation and wonder why it does not render as attachment there. This is probably a rare border case.

Yes, I think so, it only happens with long text messages that are forwarded to more than 1 contact. That is probably very rare. But even more important, it seems to have no user facing visual effect in Signal. The only way to see this bug is to inspect the database. So that is probably why nobody has found it before.

I'm not sure if it really should be tackled as signalbackup-tool. Unfortunately I share the experience that Signal bug reports rarely get any feedback.

I'm good for now. Again: Thank you very much for your support and I'll also subscribe to your new upstream bug out of pure curiosity.

I think I worked around this bug fairly well now. Building the search_idx will now only fail (gracefully) if the unexpected number of results is zero. In other cases it will simply use the first result, though it will still print a warning if there is more than 1. In the HTML output (and TXT output) I'm now simply skipping all text/x-signal-plain type files during attachment handling.

If you could test again (hopefully for the last time), that would be very much appreciated. What I expect:

  1. There is still a warning printed for those messages (but now it reads: [Warning]: Unexpected number of results from search_idx query (X results, using first))
  2. No attachments are shown for these messages in the HTML output.
  3. The messages are indexed for the search page, and thus searchable. So before, using the searchpage.html with the proper search terms, only the original and possibly the first forwarded messages would show up. Now, I expect all forwarded long messages to appear in the results.

If you could confirm this now works as expected, I'd be happy to close this and just let the Signal developers deal with the bug if/when they can.

Thanks!

bentolor commented 3 months ago

Hey @bepaald : I can confirm all of the expected outcomes. So many thanks for your love and support!

bepaald commented 3 months ago

Thanks again for all your quick and useful feedback!

bentolor commented 2 months ago

Hej cool! They fixed it quite promptly! Kudos for your 1st class upstream bug report!

bepaald commented 2 months ago

Yeah, I'm pleasantly surprised. I guess they probably read all the issues even if it takes them too much time to respond.

When this hits a beta I'll test it, but I'm sure it's probably fixed now. Thanks!