Philipp91 / picasa2digikam

Script to migrate Picasa metadata to digiKam
GNU General Public License v3.0
18 stars 3 forks source link

[bug] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position (...): invalid continuation byte #14

Closed UtopianElectronics closed 2 years ago

UtopianElectronics commented 2 years ago

After applying this patch and by running python main.py --dry_run --photos_dir="D:\gallery" --digikam_db="D:\digiKam_library\digikam4.db" --contacts="%LocalAppData%\Google\Picasa2\contacts\contacts.xml" -vv, I get this error message (excerpt from the whole output):

INFO: ===========================================================================================
INFO: Now migrating D:\gallery\2019
Traceback (most recent call last):
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 138, in migrate_directory
    ini.read(ini_file, encoding='utf8')
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\configparser.py", line 712, in read
    self._read(fp, filename)
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\configparser.py", line 1035, in _read
    for lineno, line in enumerate(fp, start=1):
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 4451: invalid continuation byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 140, in migrate_directory
    raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini".

I have no idea what <frozen codecs> means, and why it's mentioned as a file. Looks like the RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini". error is related to the patch.

Philipp91 commented 2 years ago

If you open up the file D:\gallery\2019\.picasa.ini, what data is there around position 4451. (An editor like Notepad++ allows you to jump to a certain byte position, but you can also post the entire file contents here if it's not sensitive and not super long.)

I'm not sure why the tool so far assumes that the encoding is UTF-8. Sadly none of my own files (i.e. none of my contacts) contain any non-ASCII characters, so I can't distinguish UTF-8 from ISO encodings, for instance. If you find special characters in one of your files, it would be interesting to know what encoding those were using. E.g. in Notepad++, you can change the encoding with which the file is loaded, until the characters are rendered correctly.

UtopianElectronics commented 2 years ago

what data is there around position 4451.

It's the first semicolon at the end of a contact's name (in the [Contacts2] section), same as the lines before and after. The characters in that line have been repeated in the file multiple times earlier. However, some Arabic/Persian characters increment the position number by 2. Double-checked with HxD, and it also shows it to be the ; character.

I'm not sure why the tool so far assumes that the encoding is UTF-8.

Notepad++ opens the .picasa.ini file with UTF-8 encoding by default.

Philipp91 commented 2 years ago

Notepad++ opens the .picasa.ini file with UTF-8 encoding by default.

And the Arabic characters are displayed correctly? Then it should indeed be utf-8.

Double-checked with HxD, and it also shows it to be the ; character.

Then the position (4451) is somehow off. Because if it were a plain ASCII character, then ; would be 0x3b, but the error message complains about a 0xd8 value. And because it says "invalid continuation byte", it might actually be confused by the 1 or 2 bytes before (because some bytes in UTF-8 are a whole character, whereas others need to be continued in the next byte, up to 4 in total I believe).

Philipp91 commented 2 years ago

By any chance, does it work if you replace utf-8 with ISO-8859-1 in the code?

UtopianElectronics commented 2 years ago

By any chance, does it work if you replace utf-8 with ISO-8859-1 in the code?

Yes!

And the Arabic characters are displayed correctly?

Yes.

Then the position (4451) is somehow off.

HxD shows the binary (8-bit) value of position 4451 as 00111011, and 10001100 for position 4450.

Philipp91 commented 2 years ago

0xd8==11011000

Do you see that anywhere around there?

I wonder if we should just commit this change to ISO-8859-1 for everyone. Does the file (incl. the Arabic characters at the beginning) look correct when you load it with that encoding in Notepad++? Or does something else look odd then?

UtopianElectronics commented 2 years ago

By any chance, does it work if you replace utf-8 with ISO-8859-1 in the code?

--dry_run gets executed, but it makes the UTF-8 characters (Persian text) in some people's face tags during INFO: Creating digiKam person tag (...) unreadable, but strangely some other tags also containing Farsi text are fine.

When reading contacts from C:\Users\USERNAME\AppData\Local\Google\Picasa2\contacts\contacts.xml, the Persian text is always readable.

Also, after running the same command without --dry_run , I noticed this:

INFO: Creating database backup at %s

Where's %s?

And it gets terminated by this error, after a DEBUG: self_contact_to_tag={(...)}:

Traceback (most recent call last):
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 154, in migrate_directory
    assert contact_to_tag[contact_id] == tag_id
AssertionError
DougRogers commented 2 years ago

Can the .picasa.ini file be posted here?

UtopianElectronics commented 2 years ago

Do you see that anywhere around there?

The nearest one at position 4445.

Does the file (incl. the Arabic characters at the beginning) look correct when you load it with that encoding in Notepad++?

No. It only looks correct with UTF-8.

DougRogers commented 2 years ago

Load it into Notepad (not Notepad++) and select Save As. What encoding is listed?

UtopianElectronics commented 2 years ago

Can the .picasa.ini file be posted here?

Unfortunately no, unless I put some dummy text in there which would make it useless to post.

Load it into Notepad (not Notepad++) and select Save As. What encoding is listed?

If you mean to load the .picasa.ini file, UTF-8 is selected by default, but other options are ANSI, UTF-16 LE, UTF-16 BE, and UTF-8 with BOM.

DougRogers commented 2 years ago

@UtopianElectronics Can you create a sharable .picasa.ini file that has the same issues?

Yes, I was referring to the .picasa.ini file. Notepad lists the encoding of the current file when saving, so the file is UTF-8.

DougRogers commented 2 years ago

When you open the .picasa.ini file in Notepad++, what is listed in the "Encoding" menu?

Philipp91 commented 2 years ago

Also, after running the same command without --dry_run , I noticed this: INFO: Creating database backup at %s

That was already fixed: https://github.com/Philipp91/picasa2digikam/commit/c941174a139cab19cbce7790724e786423fc4f4c

Philipp91 commented 2 years ago

The fact that loading with ISO-8859-1 in Python works but then some other characters are messed up can only mean one of two things, I believe: Either the file legitimately contains multiple different encodings, which would be quite the hassle to deal with, or it's meant to be UTF-8 but somehow a few invalid characters ended up in there. I think we should find out what happens around that 0xd8 byte. Does that byte make sense in ISO-8859-1 encoding, or is it a garbage byte no matter how one would interpret it?

The nearest one at position 4445.

That's pretty close actually. The discrepancy could be caused by one system counting bytes and the other counting characters. If no other 0xd8 byte is in the vicinity, it's safe to assume it's that one. So what's the context there, i.e. what do the surrounding bytes mean in ASCII? Is it important information and can we deduce something about the meaning of the 0xd8 byte from that?

DougRogers commented 2 years ago

I am new to encoding, but it looks like this is not a straightforward issue. It looks like detecting the actual encoding is non-trivial. This file is probably not UTF-8, but is being reported as such.

UtopianElectronics commented 2 years ago

When you open the .picasa.ini file in Notepad++, what is listed in the "Encoding" menu?

UTF-8.

the file legitimately contains multiple different encodings

Not sure about that, but I don't think it's the case.

Does that byte make sense in ISO-8859-1 encoding

When I select ISO-8859-1 in Notepad++ (Encoding > Character sets > Western European > ISO 8859-1), position 4451 changes place and goes to the beginning of the 16 characters string at the beginning of another line.

UtopianElectronics commented 2 years ago

I deleted the line in .picasa.ini that had the faulty byte at position 4451, plus the two lines before and after it, but still it says UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 4451: invalid continuation byte. Is it really about .picasa.ini or it's referring to position 4451 somewhere else?

UtopianElectronics commented 2 years ago

Is this doable here in this code? How do I test it?

Philipp91 commented 2 years ago

Just to double-check, the file in question should be "D:\gallery\2019\.picasa.ini".

Is this doable here in this code?

Well, maybe.

picasa2digikam uses the configparser library. You can open a python shell and hopefully reproduce that same error with these few lines:

import configparser
ini = configparser.ConfigParser(strict=False)
ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8')

To plug in the codecs package with that error='ignore' workaround, try this:

import configparser
import codecs
ini = configparser.ConfigParser(strict=False)
with codecs.open("D:\\gallery\\2019\\.picasa.ini", 'r', encoding='utf-8', errors='ignore') as fdata:
    ini.read_file(fdata)
Philipp91 commented 2 years ago

This service promises client-side (i.e. privacy-preserving) UTF-8 validation: https://onlineutf8tools.com/validate-utf8

Philipp91 commented 2 years ago

position 4451 changes place

Then Notepad++ is clearly counting characters, not bytes. Whereas the error message from Python is most likely based on counting bytes. That explains the discrepancy.

You can try snip a section around the byte in question like this:

with open("D:\\gallery\\2019\\.picasa.ini", "rb") as f:
    d = f.read()
print(d[4400:4500])  # Print 100 bytes around the problem byte. If this turns out non-sensitive, you can post it here.
assert d[4451] == 0xd8  # Make sure we understood the offset right

Or decode it like this, which would presumably fails with a similar error as when the data is decoded right during file loading:

d.decode('utf-8')
d[4400:4500].decode('utf-8')
UtopianElectronics commented 2 years ago

ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8')

It just outputs ['D:\\gallery\\2019\\.picasa.ini'] and no errors. Not sure what it means. Also, '' is in fact two single quotation marks, which gives a syntax error. It should be a double quotation mark.

To plug in the codecs package with that error='ignore' workaround, try this

Tried it and it shows nothing.

This service promises client-side (i.e. privacy-preserving) UTF-8 validation

It says it's valid.

as f

Shouldn't it be as fh? Because it gives a syntax error: "NameError: name 'fh' is not defined. Did you mean: 'f'?" I tried running it with as fh and it gives some characters and a AssertionError at the end.

d[4400:4500].decode('utf-8')

It decodes everything smoothly without any problem, and it showed the same characters as Notepad++. I used it like this:

with open("D:\\gallery\\2019\\.picasa.ini", "rb") as fh:
    d = fh.read()
print(d[4400:4500].decode('utf-8')) 

There are multiple subdirectories (folders) inside 2019. Could it be causing any problem?

A mystery to me is that if I edit D:\gallery\2019\.picasa.ini and delete or change characters or lines at 4451 and re-run the program, it still gives the same error about byte 0xd8 in position 4451.

Philipp91 commented 2 years ago

Also, '' is in fact two single quotation marks, which gives a syntax error. It should be a double quotation mark.

Shouldn't it be as fh?

Yeah, those were just some typos on my part, sorry.

It just outputs ['D:\gallery\2019\.picasa.ini'] and no errors. Not sure what it means.

Tried it and it shows nothing.

After this, the file has been read, so apparently it did succeed in loading the file. You can then view it by querying the ini object, e.g. by running list(ini.items()) or list(ini['Contacts2'].items()) or so, and see if the contents were correctly loaded.

It's plausible that the attempt with the codecs package and errors='ignore' went through, but I'm surprised that apparently the attempt with just ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8') threw no errors either. That's pretty much what picasa2digikam also runs (or so I believed) when it runs into this 4451 error. Can you check (as detailed just above) that this loading actually worked, i.e. data got loaded properly?

A mystery to me is that if I edit D:\gallery\2019.picasa.ini and delete or change characters or lines at 4451 and re-run the program, it still gives the same error about byte 0xd8 in position 4451.

Yeah, something is fishy here. Perhaps picasa2digikam doesn't load the ini file as intended. How about:

import configparser
import pathlib
ini = configparser.ConfigParser(strict=False)
ini.read(pathlib.Path(''D:\\gallery\\2019\\.picasa.ini"), encoding='utf8')
print(list(init.items()))

This should really be 100% what picasa2digikam calls when that 4451 error happens.


Perhaps that file has some restrictions on it that make it impossible for picasa2digikam to access (it is a hidden file after all) and then it instead receives some error message that has a non-ASCII character at 4451? You could try patching the following into migrator.py above the ini.read(... line:

with open(ini_file, "rb") as f:
    print(f'Here comes {ini_file}:')
    print(f.read())

I'd expect this to print the whole file's contents onto the console, but perhaps we get something else (like the supposed error message) instead.

UtopianElectronics commented 2 years ago

You can then view it by querying the ini object, e.g. by running list(ini.items()) or list(ini['Contacts2'].items()) or so, and see if the contents were correctly loaded.

Can you check (as detailed just above) that this loading actually worked, i.e. data got loaded properly?

Again, it outputs nothing, or maybe I'm doing it wrong? What should be in the code before them?

This should really be 100% what picasa2digikam calls when that 4451 error happens.

It works fine and without any error, also no unreadable characters in the output.

You could try patching the following into migrator.py above the ini.read(... line:

There are two ini.read(... instances, which one do you mean? Also, what the indentation should be exactly? Because I'm getting some indentation errors and tried fixing them, but I don't know if I changed the meaning of the code. Regardless of that, I still get that old error message.

Philipp91 commented 2 years ago

I meant like this: https://github.com/Philipp91/picasa2digikam/pull/15

UtopianElectronics commented 2 years ago

So I ran gh pr checkout 15 and then python main.py --dry_run --photos_dir="D:\gallery" --digikam_db="D:\digiKam_library\digikam4.db" --contacts="%LocalAppData%\Google\Picasa2\contacts\contacts.xml" -vv. Here's the output:

INFO: Now migrating D:\gallery\2019
Here comes D:\gallery\2019\.picasa.ini in binary:

It shows the non-Latin UTF-8 characters like \xd8\xaf\xd9\x8a etc., which I guess it's because of being binary? Then:

That was D:\gallery\2019\.picasa.ini.
Here comes D:\gallery\2019\.picasa.ini in UTF-8:
Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 145, in migrate_directory
    print(f.read())
          ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 602467: invalid continuation byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 150, in migrate_directory
    raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini".
Philipp91 commented 2 years ago

Huh, what's up with that position suddenly jumping to 602467. Wasn't it 4451 before? Is the file even that long (0.6MB)?

It shows the non-Latin UTF-8 characters like \xd8\xaf\xd9\x8a etc., which I guess it's because of being binary?

Yes that's okay, as long as the other characters (I assume most of the ini file is regular ASCII stuff) is output normally. How does the end look, i.e. shortly before the That was D:\gallery\2019\.picasa.ini. bit? Does it actually output precisely the end of your ini file too?


I've updated the patch. I guess you can get it with git pull or so, perhaps with -f. Now it also prints the length of the string and it decodes it after reading it as binary, let's see if that also fails or succeeds.

UtopianElectronics commented 2 years ago

602467

Checked it with Notepad++. It was a part of a file name, and it was actually shown as one of those strange symbols that Notepad++ shows if you open for example an image file. I deleted that single character and the code now seems to work fine.

Strange enough, I saved the file to another location and when I opened it, that strange symbol was changed to a readable character. So it was probably an encoding bug or something by Picasa because the actual file doesn't have that extra character in its name (I might have renamed it outside Picasa).

Wasn't it 4451 before?

I think 4451 was for another .picasa.ini file, the one at D:\gallery.

Is the file even that long (0.6MB)?

Yes, it's 595 KB.

Does it actually output precisely the end of your ini file too?

Yes.

I've updated the patch.

As the previous one seems to be working, I'm now going to check further if it's really working. I'll share the findings here.

UtopianElectronics commented 2 years ago

When I want to export the output of the command to a text file using both > log.txt and | echo > log.txt, the program gets terminated with errors. The log file ends with Here comes D:\gallery\.picasa.ini in UTF-8: and the non-ASCII UTF-8 characters in the log file are again in the format of "UTF-8 (in literal)" as shown here too.

Here's one of the many errors in the output (not in the log file):

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\logging\__init__.py", line 1113, in emit
    stream.write(msg + self.terminator)
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 63-70: character maps to <undefined>
Call stack:
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 74, in migrate_directories_under
    logging.debug(f'{contact.attrib}')
Message: "{'id': 'e5ce9e6c386f84fa', 'name': '[REDACTED]', 'modified_time': '2022-01-18T13:03:09+03:30', 'local_contact': '1'}"
Arguments: ()
Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 145, in migrate_directory
    print(f.read())
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 49-55: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 150, in migrate_directory
    raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\.picasa.ini".
Philipp91 commented 2 years ago

Ah great, do I understand correctly that you found a workaround for the problem by removing the weird character from the file?

When I want to export the output of the command to a text file

And I assume you want to do that not to debug the encoding issue any further, but rather just because you want the whole output somewhere? E.g. to see if the dry-run was successful?

cp1252.py

Looks like it's trying to log with non-UTF-8 too. Hopefully this is the fix. It's on the main branch and I've also rebased the other patch, so if you wanted to keep that one, you could check it out anew.

UtopianElectronics commented 2 years ago

Ah great, do I understand correctly that you found a workaround for the problem by removing the weird character from the file?

Yes. However, I suggest a workaround that would automatically ignore those characters without having to manually removing them. Here it mentions errors='ignore' but I'm not sure if it could also be an option for picasa2digikam.

And I assume you want to do that not to debug the encoding issue any further, but rather just because you want the whole output somewhere? E.g. to see if the dry-run was successful?

Well, I'd like to debug anything and help to make this program as flawless as it could be! But isn't the encoding issue in the dry-run already fixed? Yes, I want to carefully examine the log.

Hopefully this is the fix. It's on the main branch and I've also rebased the other patch, so if you wanted to keep that one, you could check it out anew.

It's a shame that I'm not much familiar with git. How can I exactly keep the current code and try the new commit without losing the previous version?

Philipp91 commented 2 years ago

How can I exactly keep the current code

So I assume you don't want to lose it. Then it's best to give it a name, which in Git is a branch (or a tag). If you've made local modifications (git status has non-empty output), you need to commit them first. Then you can do git checkout -b thisworks to create a branch, or git tag thisworks to create a tag, with thisworks being a name you'll understand in the future. You can find those again with git branch or git tag. Then to apply the patch on top, download all the new commits (git fetch) so that it becomes known locally, and then git cherry-pick 676e50f9064a3e308532a926d21711a6138b0c94.

UtopianElectronics commented 2 years ago

Thanks a lot! After applying this patch, I could successfully export the output to a text file. The log file seems fine, but just a small issue with \u200c instead of real half space, as also mentioned here and here. But it's not such a big deal. Fixing it, however, would be nice.

Philipp91 commented 2 years ago

I can't reproduce this. Which log output is this referring to? The one you (only) get from #15 (which I don't intend to merge ever)? Or could you identify another logging.info() or logging.debug() statement that produces this log output?

And why do you care? Besides reading the log file, do you have another use case where you need the characters to be output correctly? When you don't redirect the output to a file but read it on the terminal directly, is it also "wrong" there?

UtopianElectronics commented 2 years ago

I had just exported the command output to a text file, and noticed this:

INFO: Traversing input directories
DEBUG: Reading contacts from C:\Users\USERNAME\AppData\Local\Google\Picasa2\contacts\contacts.xml
DEBUG: {'id': '2c112b01a7d580c5', 'name': '[REDACTED] ي\u200cک [REDACTED]', 'modified_time': '2022-01-24T16:27:25+03:30', 'local_contact': '1'}

It's after the recent patch. I don't know if the older ones would result in this as well.

And why do you care?

I don't. I just thought maybe it would result in errors later at final steps.

When you don't redirect the output to a file but read it on the terminal directly, is it also "wrong" there?

It's more than 71000 lines and I can't scroll back to it on the terminal despite changing the buffer size to 100000. However, by pressing the key pause break, it shows that it's the same on the terminal as well.

Philipp91 commented 2 years ago

I think it's intended that log output uses \u200c instead of the proper representation. After all, it's meant for debugging purposes and not for end-user output. So I think this won't change and it doesn't affect the subsequent program run. It's just a different representation of the same data, and the program continues operating on the data as it is without (re)presenting it.

I suggest a workaround that would automatically ignore those characters without having to manually removing them.

How many characters in total were affected in your case, and across how many different files? If just a single bit flipped on your disk, I'm inclined to call that a random coincidence and wouldn't change the code.

UtopianElectronics commented 2 years ago

So I think this won't change and it doesn't affect the subsequent program run. It's just a different representation of the same data, and the program continues operating on the data as it is without (re)presenting it.

That's fine. Thank you.

How many characters in total were affected in your case, and across how many different files?

Just one character and one file.

This issue seems to be fixed by now. I'm closing it for now.