carderne / signal-export

Export your Signal chats to markdown files with attachments
Other
481 stars 50 forks source link

Markdown style is ambiguous: add newline after each entry, after each blockquote #105

Closed huyz closed 6 months ago

huyz commented 8 months ago

Desktop (please complete the following information):

Describe the bug

Because there's no newline after each entry or after a blockquote, when viewing the index.md in common Markdown viewers, there's too much ambiguity and different renderers will assume that all the lines are part of the same block.

[2023-03-22 08:34] Friend: 
>
> https://github.com/carderne/signal-export
finally found what i was looking for, for ages
>
Thanks for that link  
[2023-03-22 09:17] Friend: You using it?
(- huyz: đź‘Ť -)
[2023-03-22 09:38] Me: love it!  

could be rendered as (standard Markdown rules):

15092

or as (if the renderer is configured to render all new lines as <br> inside paragraphs)

59089

when the markdown could be made less ambiguous by adding newlines:

[2023-03-22 08:34] Friend:
>
> https://github.com/carderne/signal-export
finally found what i was looking for, for ages
>

Thanks for that link

[2023-03-22 09:17] Friend: You using it?
(- huyz: đź‘Ť -)

[2023-03-22 09:38] Me: love it!  

so that they are rendered either as (standard Markdown rules):

59143

or (if the renderer is configured to render all new lines as <br> inside paragraphs)

47161

Here is the relevant style guide rule: https://www.markdownguide.org/basic-syntax/#blockquotes-best-practices

To reproduce Steps to reproduce the behavior. Please include the exact commands tried.

  1. Create a message in Signal that quotes a previous message
  2. Export
  3. Open the message in MacDown or IntelliJ or VS Code
carderne commented 8 months ago

Thanks for raising this, it's a really good point. I'll probably put this behind a flag for now, as it's quite a big change to the output...

Have just pushed the needed change to main: https://github.com/carderne/signal-export/commit/4816f1b3480a5785b9fddfa7b44841333ee3f755

Haven't yet released (i.e. won't be available to pip install yet.

Will get to it in the next day or two.

carderne commented 8 months ago

I've just released a new untested version. if you pip install signal-export==1.8.0 and use the flag --newlines you can give it a whirl

huyz commented 8 months ago

Great, thanks for responding so quickly!

But I have a weird issue now. As soon as I update signal-export as above, I get this error now:

Using Docker to extract data, this may take a while the first time!
Docker process failed, see logs below:
Command '['docker', 'run', '--rm', '--volume=/Users/huyz/Library/Application Support/Signal:/Signal', 'carderne/sigexport:v1.8.0', '--no-use-docker', '--print-data', '--include-empty']' returned non-zero exit status 125.

It used to say:

Copying and renaming attachments
Creating markdown files
Merging old at snapshot.2024-02-03T151753Z into output directory
No existing files will be deleted or overwritten!
Done!

My installation shouldn't be using Docker, so I don't know why this happens as soon as upgrade signal-export

carderne commented 8 months ago

Nothing should have changed whether or how Docker is called, are you sure it wasn't using it before?

You can try downgrade to an earlier version pip install signal-export==1.7.1

or I just pushed a version that will provide more details about why Docker failed pip install signal-export==1.8.1 (possibly the Docker daemon isn't running.)

huyz commented 8 months ago

Oh I was upgrading from 1.6.0 and I see that you've changed the default to --use-docker. I just needed to add --no-use-docker to my invocation. I was confused for a while because I didn't remember ever setting Docker to be used.

huyz commented 7 months ago

I'm still checking it out. I'm a bit confused right now cause some messages are repeated out of order. I think this is Signal's fault, though as I've noticed similar weird things. Bear with me…

carderne commented 7 months ago

About the --no-use-docker: I changed it because on Linux/Windows, it's not straightforward to get it to work properly.

Did you find it simple on macos to just brew install openssl sqlcipher and then the script worked? Maybe I should add a check like if on macos and openssl/sqlcipher binaries available -> skip docker...

huyz commented 7 months ago

Did you find it simple on macos to just brew install openssl sqlcipher and then the script worked?

Yep it worked just fine. I don't need Docker at all.

I wrote:

I'm still checking it out. I'm a bit confused right now cause some messages are repeated out of order. I think this is Signal's fault, though as I've noticed similar weird things.

Actually, it looks like, because I'm running signal-export with --old and with --newlines flag, the old messages aren't recognized so that's why I had everything in duplicate.

Not a big deal if this is fixed, as this is a one-time issue which wouldn't affect anyone running --newlines from the get-go. The current changes here are already good enough for me, as I can clean up the merged output manually one time.

Thanks for your help. Feel free to close this issue.

mitar commented 7 months ago

I think this flag should be the default.

carderne commented 7 months ago

Just need to have a look at @huyz issue around merging with old, then will make it default.

huyz commented 6 months ago

For those who need to retroactively fix the markdown chat history, you can either:

  1. run the entire export without the --old flag
  2. in the case where the past chat history is lost due to auto-delete settings in Signal, you'd have to fix the currently-export markdown.

For the latter, I wrote a perl script to fix things: https://gist.github.com/huyz/f8af7941ffc1c961f325f204b8367017 Just edit with the timestamp of the second message that uses the new format (a format I call number v2) after applying https://github.com/carderne/signal-export/commit/4816f1b3480a5785b9fddfa7b44841333ee3f755 and run it on the index.md in the current file.

This does 2 things on all the events in format 1:

  1. add the newline at the end of the event
  2. try to fix the blockquotes, which is still an issue as I explained in https://github.com/carderne/signal-export/commit/4816f1b3480a5785b9fddfa7b44841333ee3f755#r139671078
    • Not guaranteed to work 100% of the time because format v1 is ambiguous in rare cases. But it should be fine most of the time.

It will not touch events in format v2 (the ones that start with the date that you've configured above) yet. So those untouched events will have correct newlines in between events but the blockquotes will still be wrong. Once the fix for https://github.com/carderne/signal-export/commit/4816f1b3480a5785b9fddfa7b44841333ee3f755#r139671078 is in, we'll have a format v3 defined. At that point, I'll update the above perl script to fix the blockquotes from in the events in format v2 so that everything is in format v3.

huyz commented 6 months ago

@carderne

Just need to have a look at @huyz issue around merging with old, then will make it default.

This issue is pretty clear after I looked at this line of code, which handles the merging and deduplication: https://github.com/carderne/signal-export/blob/649f9f54e98ef1679a99d0fff2bfbb939b1cfeb3/sigexport/main.py#L437

This confirms that the de-duplication requires messages to match exactly. So whenever the output format of signal-export changes, this creates problems as duplicates of those specific messages whose format has changed won't be detected anymore.

Now, if one has the entire chat history still in the Signal DB (e.g., you don't make use of Disappearing messages for any conversation), then it's simple: just don't run with --old and run with --newlines (after https://github.com/carderne/signal-export/pull/112 is released, which I call format v3). And you're done.

If however one or more of your chats uses the Disappearing messages setting and Signal has already auto-disappeared some messages from the local DB, then you have no choice but to try to reuse the existing exports.

To handle the fact that we now have 3 export formats (original v1, v2 with newlines but incorrect blockquoting, and v3 with both newlines and corrected blockquoting), I updated my perl script https://gist.github.com/huyz/f8af7941ffc1c961f325f204b8367017 to handle all cases (as much as possible). This should actually be run as early as possible, before the next signal export is invoked with --old. This script will reformat old messages in index.md (whether in v1 or v2) so that they're compatible with v3. That way signal-export will properly detect duplicates on the next run.

Note that there is a bit of ambiguity with the old messages as we can't always be 100% sure when a blockquote actually ends in formats v1 and v2. But the perl script should work in most, if not all, cases. You'd know if you there are duplicates if you see some messages that are seemingly out of date order. (I actually wrote some script to extracts timestamps and plotted them on observablehq.com to visually check that the dates were monotonously increasing—that gave me a reasonable idea that there aren't any more spurious duplicates; i.e. that the retroactive markdown fixes match the new format v3).

Hope this helps for anyone looking to transition.

huyz commented 6 months ago

Perhaps if someone doesn't use the --old flag, then --newlines can become the default. Backward compatibility of the format only makes sense for --old.

Oh btw, since the format of blockquote has changed, regardless of the --newlines flag, back in v1.8.0, people who have been using --old will most likely already have duplicate issues.

Hmm I wonder what a long term solution for handling new format changes is. Not an easy problem to solve.

carderne commented 6 months ago

Hey @huyz thanks for all the input.

I've updated the merge function so it should handle differences in formats (for now basically just ignore \n and > symbols) and I've made the extra newlines default.

This is all currently in a pre-release version, would be great if you can give it a whirl!

pip install signal-export>=2.0.0

If I don't hear from you I'll assume this is all sorted and close the issue, and then I'll release 2.0.0 so everyone gets the change.