undefined character error

da5nsy commented 2 years ago

Desktop (please complete the following information):

OS and version: Windows 10
- Terminal/shell used: Miniconda

Describe the bug Undefined character causes and error.

│ C:\Users\cege-user\miniconda3\envs\signal_backup\lib\encodings\cp1252.py:19 in encode            │
│                                                                                                  │
│    16                                                                                            │
│    17 class IncrementalEncoder(codecs.IncrementalEncoder):                                       │
│    18 │   def encode(self, input, final=False):                                                  │
│ >  19 │   │   return codecs.charmap_encode(input,self.errors,encoding_table)[0]                  │
│    20                                                                                            │
│    21 class IncrementalDecoder(codecs.IncrementalDecoder):                                       │
│    22 │   def decode(self, input, final=False):                                                  │
│                                                                                                  │
│ ┌────────────────────────────────── locals ──────────────────────────────────┐                   │
│ │ final = False                                                              │                   │
│ │ input = '[2021-01-07 15:48] JamesDrane: 👍🏼  '                            │                   │
│ │  self = <encodings.cp1252.IncrementalEncoder object at 0x00000204679442B0> │                   │
│ └────────────────────────────────────────────────────────────────────────────┘                   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
UnicodeEncodeError: 'charmap' codec can't encode characters in position 31-32: character maps to <undefined>

(signal_backup) C:\Users\cege-user>

carderne commented 2 years ago

I need a little bit more of the error trace, specifically the part that points out on which line of sigexport/main.py the error happened.

da5nsy commented 2 years ago

🤔 I don't think I have that - above the paginate=100 line (top line in the screenshot) there are just hundreds of lines of data. Any pointers?

carderne commented 2 years ago

Ok so how signal-export works, assuming you used the Docker method, is runs the extract inside docker, dumps out the data as a JSON string, and then does stuff with the data outside Docker.

When it loads the text, it is trying to do so with cp1252 encoding (Windows US default, as far as I can tell), whereas it should be using utf-8.

Are you able to edit the file sigexport/main.py on your computer? It's probably somewhere near this file: C:\Users\cege-user\miniconda3\envs\signal_backup\lib\encodings\cp1252.py. If so edit this line: https://github.com/carderne/signal-export/blob/15b892833599043bf068166b372f0bcbbd4af396/sigexport/main.py#L519

As follows:

             p = subprocess.run(cmd, capture_output=True, text=True, check=True)
-            data = json.loads(p.stdout)
+            data = json.loads(p.stdout.encode("utf-8"))
             convos, contacts = data["convos"], data["contacts"]

And see if that fixes the problem!

da5nsy commented 2 years ago

Thank you @carderne!

For our future reference, the file was at C:\Users\cege-user\miniconda3\envs\signal_backup\Lib\site-packages\sigexport\main.py

I changed the line, but unfortunately, the result was the same.

Blacklands commented 1 year ago

I have the same issue. I'm also on Windows (10), using Docker through WSL (in case it matters).

The failure first happens in

main.py:571 in 
│ main                                                                                             
│                                                                                                  
│   568 │   secho("Creating markdown files")                                                       
│   569 │   for md_path, md_text in create_markdown(dest, convos, contacts, quote):                
│   570 │   │   with md_path.open("a") as md_file:                                                 
│ ❱ 571 │   │   │   print(md_text, file=md_file)

I've changed the encoding for open() to "utf-8". That fixed that, but it next breaks in

main.py:266 in   
│ create_html                                                                                      
│                                                                                                  
│   263 │   │   │   # touch first                                                                  
│   264 │   │   │   open(path, "a")                                                                
│   265 │   │   │   with path.open() as f:                                                         
│ ❱ 266 │   │   │   │   lines_raw = f.readlines()

After adding UTF-8 encoding to both of these lines as well, the next time it breaks in

main.py:582 in   
│ main                                                                                             
│                                                                                                  
│   579 │   │   │   paginate = int(1e20)                                                           
│   580 │   │   for ht_path, ht_text in create_html(dest, msgs_per_page=paginate):                 
│   581 │   │   │   with ht_path.open("w") as ht_file:                                             
│ ❱ 582 │   │   │   │   print(ht_text, file=ht_file)

After that, it works.

The output seems to be correct, too. I checked the place in my chats where it first breaks, and it seems it's either an ö (German umlaut) or an emoji that comes a couple characters after that.

So every place in the code where Python tries to use the cp1252.py encoder, it breaks on the first "special" character, it seems.

Shouldn't the encoding set in the code be UTF-8, by default? Why use any other encoding? Is there a reason to not just change it to that?

carderne commented 1 year ago

Hi @Blacklands thanks for looking into that. I haven't changed anything yet because I wasn't yet sure if this would fix it, and I'm generally unsure how encoding stuff works on Windows.

Could you please submit a PR with the changes you made?

carderne / signal-export

undefined character error #78