Closed UtopianElectronics closed 2 years ago
If you open up the file D:\gallery\2019\.picasa.ini
, what data is there around position 4451. (An editor like Notepad++ allows you to jump to a certain byte position, but you can also post the entire file contents here if it's not sensitive and not super long.)
I'm not sure why the tool so far assumes that the encoding is UTF-8. Sadly none of my own files (i.e. none of my contacts) contain any non-ASCII characters, so I can't distinguish UTF-8 from ISO encodings, for instance. If you find special characters in one of your files, it would be interesting to know what encoding those were using. E.g. in Notepad++, you can change the encoding with which the file is loaded, until the characters are rendered correctly.
what data is there around position 4451.
It's the first semicolon at the end of a contact's name (in the [Contacts2]
section), same as the lines before and after. The characters in that line have been repeated in the file multiple times earlier. However, some Arabic/Persian characters increment the position number by 2. Double-checked with HxD, and it also shows it to be the ;
character.
I'm not sure why the tool so far assumes that the encoding is UTF-8.
Notepad++ opens the .picasa.ini
file with UTF-8 encoding by default.
Notepad++ opens the .picasa.ini file with UTF-8 encoding by default.
And the Arabic characters are displayed correctly? Then it should indeed be utf-8.
Double-checked with HxD, and it also shows it to be the ; character.
Then the position (4451) is somehow off. Because if it were a plain ASCII character, then ;
would be 0x3b, but the error message complains about a 0xd8 value. And because it says "invalid continuation byte", it might actually be confused by the 1 or 2 bytes before (because some bytes in UTF-8 are a whole character, whereas others need to be continued in the next byte, up to 4 in total I believe).
By any chance, does it work if you replace utf-8
with ISO-8859-1
in the code?
By any chance, does it work if you replace
utf-8
withISO-8859-1
in the code?
Yes!
And the Arabic characters are displayed correctly?
Yes.
Then the position (4451) is somehow off.
HxD shows the binary (8-bit) value of position 4451
as 00111011
, and 10001100
for position 4450
.
0xd8==11011000
Do you see that anywhere around there?
I wonder if we should just commit this change to ISO-8859-1
for everyone. Does the file (incl. the Arabic characters at the beginning) look correct when you load it with that encoding in Notepad++? Or does something else look odd then?
By any chance, does it work if you replace
utf-8
withISO-8859-1
in the code?
--dry_run
gets executed, but it makes the UTF-8 characters (Persian text) in some people's face tags during INFO: Creating digiKam person tag (...)
unreadable, but strangely some other tags also containing Farsi text are fine.
When reading contacts from C:\Users\USERNAME\AppData\Local\Google\Picasa2\contacts\contacts.xml
, the Persian text is always readable.
Also, after running the same command without --dry_run
, I noticed this:
INFO: Creating database backup at %s
Where's %s
?
And it gets terminated by this error, after a DEBUG: self_contact_to_tag={(...)}
:
Traceback (most recent call last):
File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 65, in <module>
main()
File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 55, in main
migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 107, in migrate_directories_under
contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 154, in migrate_directory
assert contact_to_tag[contact_id] == tag_id
AssertionError
Can the .picasa.ini file be posted here?
Do you see that anywhere around there?
The nearest one at position 4445.
Does the file (incl. the Arabic characters at the beginning) look correct when you load it with that encoding in Notepad++?
No. It only looks correct with UTF-8.
Load it into Notepad (not Notepad++) and select Save As. What encoding is listed?
Can the .picasa.ini file be posted here?
Unfortunately no, unless I put some dummy text in there which would make it useless to post.
Load it into Notepad (not Notepad++) and select Save As. What encoding is listed?
If you mean to load the .picasa.ini
file, UTF-8 is selected by default, but other options are ANSI, UTF-16 LE, UTF-16 BE, and UTF-8 with BOM.
@UtopianElectronics Can you create a sharable .picasa.ini file that has the same issues?
Yes, I was referring to the .picasa.ini file. Notepad lists the encoding of the current file when saving, so the file is UTF-8.
When you open the .picasa.ini file in Notepad++, what is listed in the "Encoding" menu?
Also, after running the same command without --dry_run , I noticed this:
INFO: Creating database backup at %s
That was already fixed: https://github.com/Philipp91/picasa2digikam/commit/c941174a139cab19cbce7790724e786423fc4f4c
The fact that loading with ISO-8859-1
in Python works but then some other characters are messed up can only mean one of two things, I believe: Either the file legitimately contains multiple different encodings, which would be quite the hassle to deal with, or it's meant to be UTF-8 but somehow a few invalid characters ended up in there. I think we should find out what happens around that 0xd8 byte. Does that byte make sense in ISO-8859-1 encoding, or is it a garbage byte no matter how one would interpret it?
The nearest one at position 4445.
That's pretty close actually. The discrepancy could be caused by one system counting bytes and the other counting characters. If no other 0xd8 byte is in the vicinity, it's safe to assume it's that one. So what's the context there, i.e. what do the surrounding bytes mean in ASCII? Is it important information and can we deduce something about the meaning of the 0xd8 byte from that?
I am new to encoding, but it looks like this is not a straightforward issue. It looks like detecting the actual encoding is non-trivial. This file is probably not UTF-8, but is being reported as such.
When you open the .picasa.ini file in Notepad++, what is listed in the "Encoding" menu?
UTF-8.
the file legitimately contains multiple different encodings
Not sure about that, but I don't think it's the case.
Does that byte make sense in ISO-8859-1 encoding
When I select ISO-8859-1 in Notepad++ (Encoding > Character sets > Western European > ISO 8859-1), position 4451 changes place and goes to the beginning of the 16 characters string at the beginning of another line.
I deleted the line in .picasa.ini
that had the faulty byte at position 4451, plus the two lines before and after it, but still it says UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 4451: invalid continuation byte
. Is it really about .picasa.ini
or it's referring to position 4451 somewhere else?
Is this doable here in this code? How do I test it?
Just to double-check, the file in question should be "D:\gallery\2019\.picasa.ini"
.
Is this doable here in this code?
Well, maybe.
picasa2digikam uses the configparser library. You can open a python
shell and hopefully reproduce that same error with these few lines:
import configparser
ini = configparser.ConfigParser(strict=False)
ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8')
To plug in the codecs package with that error='ignore'
workaround, try this:
import configparser
import codecs
ini = configparser.ConfigParser(strict=False)
with codecs.open("D:\\gallery\\2019\\.picasa.ini", 'r', encoding='utf-8', errors='ignore') as fdata:
ini.read_file(fdata)
This service promises client-side (i.e. privacy-preserving) UTF-8 validation: https://onlineutf8tools.com/validate-utf8
position 4451 changes place
Then Notepad++ is clearly counting characters, not bytes. Whereas the error message from Python is most likely based on counting bytes. That explains the discrepancy.
You can try snip a section around the byte in question like this:
with open("D:\\gallery\\2019\\.picasa.ini", "rb") as f:
d = f.read()
print(d[4400:4500]) # Print 100 bytes around the problem byte. If this turns out non-sensitive, you can post it here.
assert d[4451] == 0xd8 # Make sure we understood the offset right
Or decode it like this, which would presumably fails with a similar error as when the data is decoded right during file loading:
d.decode('utf-8')
d[4400:4500].decode('utf-8')
ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8')
It just outputs ['D:\\gallery\\2019\\.picasa.ini']
and no errors. Not sure what it means.
Also, ''
is in fact two single quotation marks, which gives a syntax error. It should be a double quotation mark.
To plug in the codecs package with that error='ignore' workaround, try this
Tried it and it shows nothing.
This service promises client-side (i.e. privacy-preserving) UTF-8 validation
It says it's valid.
as f
Shouldn't it be as fh
? Because it gives a syntax error: "NameError: name 'fh' is not defined. Did you mean: 'f'?
" I tried running it with as fh
and it gives some characters and a AssertionError
at the end.
d[4400:4500].decode('utf-8')
It decodes everything smoothly without any problem, and it showed the same characters as Notepad++. I used it like this:
with open("D:\\gallery\\2019\\.picasa.ini", "rb") as fh:
d = fh.read()
print(d[4400:4500].decode('utf-8'))
There are multiple subdirectories (folders) inside 2019
. Could it be causing any problem?
A mystery to me is that if I edit D:\gallery\2019\.picasa.ini
and delete or change characters or lines at 4451
and re-run the program, it still gives the same error about byte 0xd8 in position 4451
.
Also, '' is in fact two single quotation marks, which gives a syntax error. It should be a double quotation mark.
Shouldn't it be as fh?
Yeah, those were just some typos on my part, sorry.
It just outputs ['D:\gallery\2019\.picasa.ini'] and no errors. Not sure what it means.
Tried it and it shows nothing.
After this, the file has been read, so apparently it did succeed in loading the file. You can then view it by querying the ini
object, e.g. by running list(ini.items())
or list(ini['Contacts2'].items())
or so, and see if the contents were correctly loaded.
It's plausible that the attempt with the codecs package and errors='ignore'
went through, but I'm surprised that apparently the attempt with just ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8')
threw no errors either. That's pretty much what picasa2digikam also runs (or so I believed) when it runs into this 4451 error. Can you check (as detailed just above) that this loading actually worked, i.e. data got loaded properly?
A mystery to me is that if I edit D:\gallery\2019.picasa.ini and delete or change characters or lines at 4451 and re-run the program, it still gives the same error about byte 0xd8 in position 4451.
Yeah, something is fishy here. Perhaps picasa2digikam doesn't load the ini file as intended. How about:
import configparser
import pathlib
ini = configparser.ConfigParser(strict=False)
ini.read(pathlib.Path(''D:\\gallery\\2019\\.picasa.ini"), encoding='utf8')
print(list(init.items()))
This should really be 100% what picasa2digikam calls when that 4451 error happens.
Perhaps that file has some restrictions on it that make it impossible for picasa2digikam to access (it is a hidden file after all) and then it instead receives some error message that has a non-ASCII character at 4451? You could try patching the following into migrator.py
above the ini.read(...
line:
with open(ini_file, "rb") as f:
print(f'Here comes {ini_file}:')
print(f.read())
I'd expect this to print the whole file's contents onto the console, but perhaps we get something else (like the supposed error message) instead.
You can then view it by querying the
ini
object, e.g. by runninglist(ini.items())
orlist(ini['Contacts2'].items())
or so, and see if the contents were correctly loaded.Can you check (as detailed just above) that this loading actually worked, i.e. data got loaded properly?
Again, it outputs nothing, or maybe I'm doing it wrong? What should be in the code before them?
This should really be 100% what picasa2digikam calls when that 4451 error happens.
It works fine and without any error, also no unreadable characters in the output.
You could try patching the following into
migrator.py
above theini.read(...
line:
There are two ini.read(...
instances, which one do you mean? Also, what the indentation should be exactly? Because I'm getting some indentation errors and tried fixing them, but I don't know if I changed the meaning of the code. Regardless of that, I still get that old error message.
I meant like this: https://github.com/Philipp91/picasa2digikam/pull/15
So I ran gh pr checkout 15
and then python main.py --dry_run --photos_dir="D:\gallery" --digikam_db="D:\digiKam_library\digikam4.db" --contacts="%LocalAppData%\Google\Picasa2\contacts\contacts.xml" -vv
. Here's the output:
INFO: Now migrating D:\gallery\2019
Here comes D:\gallery\2019\.picasa.ini in binary:
It shows the non-Latin UTF-8 characters like \xd8\xaf\xd9\x8a
etc., which I guess it's because of being binary
?
Then:
That was D:\gallery\2019\.picasa.ini.
Here comes D:\gallery\2019\.picasa.ini in UTF-8:
Traceback (most recent call last):
File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 145, in migrate_directory
print(f.read())
^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 602467: invalid continuation byte
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
main()
File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 107, in migrate_directories_under
contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 150, in migrate_directory
raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini".
Huh, what's up with that position suddenly jumping to 602467. Wasn't it 4451 before? Is the file even that long (0.6MB)?
It shows the non-Latin UTF-8 characters like \xd8\xaf\xd9\x8a etc., which I guess it's because of being binary?
Yes that's okay, as long as the other characters (I assume most of the ini file is regular ASCII stuff) is output normally.
How does the end look, i.e. shortly before the That was D:\gallery\2019\.picasa.ini.
bit? Does it actually output precisely the end of your ini file too?
I've updated the patch. I guess you can get it with git pull
or so, perhaps with -f
. Now it also prints the length of the string and it decodes it after reading it as binary, let's see if that also fails or succeeds.
602467
Checked it with Notepad++. It was a part of a file name, and it was actually shown as one of those strange symbols that Notepad++ shows if you open for example an image file. I deleted that single character and the code now seems to work fine.
Strange enough, I saved the file to another location and when I opened it, that strange symbol was changed to a readable character. So it was probably an encoding bug or something by Picasa because the actual file doesn't have that extra character in its name (I might have renamed it outside Picasa).
Wasn't it 4451 before?
I think 4451 was for another .picasa.ini
file, the one at D:\gallery
.
Is the file even that long (0.6MB)?
Yes, it's 595 KB.
Does it actually output precisely the end of your ini file too?
Yes.
I've updated the patch.
As the previous one seems to be working, I'm now going to check further if it's really working. I'll share the findings here.
When I want to export the output of the command to a text file using both > log.txt
and | echo > log.txt
, the program gets terminated with errors. The log file ends with Here comes D:\gallery\.picasa.ini in UTF-8:
and the non-ASCII UTF-8 characters in the log file are again in the format of "UTF-8 (in literal)" as shown here too.
Here's one of the many errors in the output (not in the log file):
--- Logging error ---
Traceback (most recent call last):
File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\logging\__init__.py", line 1113, in emit
stream.write(msg + self.terminator)
File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 63-70: character maps to <undefined>
Call stack:
File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
main()
File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 74, in migrate_directories_under
logging.debug(f'{contact.attrib}')
Message: "{'id': 'e5ce9e6c386f84fa', 'name': '[REDACTED]', 'modified_time': '2022-01-18T13:03:09+03:30', 'local_contact': '1'}"
Arguments: ()
Traceback (most recent call last):
File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 145, in migrate_directory
print(f.read())
File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 49-55: character maps to <undefined>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
main()
File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 107, in migrate_directories_under
contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 150, in migrate_directory
raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\.picasa.ini".
Ah great, do I understand correctly that you found a workaround for the problem by removing the weird character from the file?
When I want to export the output of the command to a text file
And I assume you want to do that not to debug the encoding issue any further, but rather just because you want the whole output somewhere? E.g. to see if the dry-run was successful?
cp1252.py
Looks like it's trying to log with non-UTF-8 too. Hopefully this is the fix. It's on the main
branch and I've also rebased the other patch, so if you wanted to keep that one, you could check it out anew.
Ah great, do I understand correctly that you found a workaround for the problem by removing the weird character from the file?
Yes. However, I suggest a workaround that would automatically ignore those characters without having to manually removing them. Here it mentions errors='ignore'
but I'm not sure if it could also be an option for picasa2digikam.
And I assume you want to do that not to debug the encoding issue any further, but rather just because you want the whole output somewhere? E.g. to see if the dry-run was successful?
Well, I'd like to debug anything and help to make this program as flawless as it could be! But isn't the encoding issue in the dry-run already fixed? Yes, I want to carefully examine the log.
Hopefully this is the fix. It's on the main branch and I've also rebased the other patch, so if you wanted to keep that one, you could check it out anew.
It's a shame that I'm not much familiar with git. How can I exactly keep the current code and try the new commit without losing the previous version?
How can I exactly keep the current code
So I assume you don't want to lose it. Then it's best to give it a name, which in Git is a branch (or a tag). If you've made local modifications (git status
has non-empty output), you need to commit them first. Then you can do git checkout -b thisworks
to create a branch, or git tag thisworks
to create a tag, with thisworks
being a name you'll understand in the future. You can find those again with git branch
or git tag
. Then to apply the patch on top, download all the new commits (git fetch
) so that it becomes known locally, and then git cherry-pick 676e50f9064a3e308532a926d21711a6138b0c94
.
I can't reproduce this. Which log output is this referring to? The one you (only) get from #15 (which I don't intend to merge ever)? Or could you identify another logging.info()
or logging.debug()
statement that produces this log output?
And why do you care? Besides reading the log file, do you have another use case where you need the characters to be output correctly? When you don't redirect the output to a file but read it on the terminal directly, is it also "wrong" there?
I had just exported the command output to a text file, and noticed this:
INFO: Traversing input directories
DEBUG: Reading contacts from C:\Users\USERNAME\AppData\Local\Google\Picasa2\contacts\contacts.xml
DEBUG: {'id': '2c112b01a7d580c5', 'name': '[REDACTED] ي\u200cک [REDACTED]', 'modified_time': '2022-01-24T16:27:25+03:30', 'local_contact': '1'}
It's after the recent patch. I don't know if the older ones would result in this as well.
And why do you care?
I don't. I just thought maybe it would result in errors later at final steps.
When you don't redirect the output to a file but read it on the terminal directly, is it also "wrong" there?
It's more than 71000 lines and I can't scroll back to it on the terminal despite changing the buffer size to 100000. However, by pressing the key pause break
, it shows that it's the same on the terminal as well.
I think it's intended that log output uses \u200c
instead of the proper representation. After all, it's meant for debugging purposes and not for end-user output. So I think this won't change and it doesn't affect the subsequent program run. It's just a different representation of the same data, and the program continues operating on the data as it is without (re)presenting it.
I suggest a workaround that would automatically ignore those characters without having to manually removing them.
How many characters in total were affected in your case, and across how many different files? If just a single bit flipped on your disk, I'm inclined to call that a random coincidence and wouldn't change the code.
So I think this won't change and it doesn't affect the subsequent program run. It's just a different representation of the same data, and the program continues operating on the data as it is without (re)presenting it.
That's fine. Thank you.
How many characters in total were affected in your case, and across how many different files?
Just one character and one file.
This issue seems to be fixed by now. I'm closing it for now.
After applying this patch and by running
python main.py --dry_run --photos_dir="D:\gallery" --digikam_db="D:\digiKam_library\digikam4.db" --contacts="%LocalAppData%\Google\Picasa2\contacts\contacts.xml" -vv
, I get this error message (excerpt from the whole output):I have no idea what
<frozen codecs>
means, and why it's mentioned as a file. Looks like theRuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini".
error is related to the patch.