Exception when returning non-UTF-8 model names

ghost commented 4 years ago

I have a device with model name b"MMC H8G4a\x92", this is an 8GB Hynix eMMC module on a Dell/Wyse 3040 thin client.

https://github.com/dcantrell/pyparted/blob/96213843607faa3919b7a7fdd0b6194e9687a23d/src/parted/device.py#L78

This raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 9: invalid start byte when retrieved.

I'm not sure if there is some spec this is violating, or if the UTF-8 assumption is wrong.

dcantrell commented 3 years ago

All strings in Python 3 are Unicode which can be kind of a pain when coming from Python 2. In this case, the code you are looking at returns the strings as libparted gives them to us, which is not necessarily what Python is expecting. I may need to wrap things in str() or something like that since the strings.

What happens if you wrap self.__device.model in str()?

ghost commented 3 years ago

Same error, seems to be deeper than the line I linked, but that means it's in C?

Python 3.6.8 (default, Apr 16 2020, 01:36:27)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import parted
>>> parted.getDevice('/dev/mmcblk0')
<parted.device.Device object at 0x7f83d58dc240>
>>> parted.getDevice('/dev/mmcblk0').model
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/parted/device.py", line 77, in model
    return str(self.__device.model)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 9: invalid start byte

I just installed pyparted via yum, so this is not the same version as the original error, but it seems to reproduce fine. I'm currently using python3-pyparted-3.11.0-13.el8.x86_64 on CentOS 8, not sure what the git commit of that is. I can use a pypi version or source if that would be helpful at all.

>>> parted.getDevice('/dev/mmcblk0')._Device__device
<_ped.Device object at 0x7fd10d4d2488>
>>> parted.getDevice('/dev/mmcblk0')._Device__device.model
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 9: invalid start byte

Definitely looks like a C error.

Maybe here: https://github.com/dcantrell/pyparted/blob/f8a04f70c9015784071ca829dd94c8486f23e278/src/pydevice.c#L258 Should this be assuming unicode or return bytes instead (or catch the exception, not sure how that's done in C)? My C skills are fairly basic though, so not sure how to make that happen.

dcantrell commented 3 years ago

The version you're using is fine. I wrote pyparted and continue to be upstream. And I maintain the packages for it in Fedora, RHEL, and by extension CentOS. Installing from pypi actually makes it harder for me to debug.

I think you're right in that the _ped module in C needs to be doing some string manipulation here before passing to Python. libpython may actually provide some utility functions to handle this. Let me check...

So from the documentation it looks like PyUnicode_FromString() is expecting the argument to be a UTF-8 encoded string. We are not getting that from libparted. We get a regular old ASCII string back. So pyparted probably needs to use PyUnicode_FromFormat() instead. Something like:

return PyUnicode_FromFormat("%s", self->model);

If you can, try changing that in the pyparted source (and see if there are other occurrences of PyUnicode_FromString to change), rebuild, and give it a try again.

ghost commented 3 years ago

Yep, that's sufficient.

Using pyparted master and this change:

diff --git a/src/pydevice.c b/src/pydevice.c
index 20f492e..31f4612 100644
--- a/src/pydevice.c
+++ b/src/pydevice.c
@@ -255,7 +255,7 @@ PyObject *_ped_Device_get(_ped_Device *self, void *closure) {

     if (!strcmp(member, "model")) {
         if (self->model != NULL)
-            return PyUnicode_FromString(self->model);
+            return PyUnicode_FromFormat("%s", self->model);
         else
             return PyUnicode_FromString("");
     } else if (!strcmp(member, "path")) {

# python3
Python 3.6.8 (default, Apr 16 2020, 01:36:27)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import parted
>>> parted.getDevice('/dev/mmcblk0').model
'MMC H8G4a�'

The hardest part was figuring out that the PowerTools repository was a thing in CentOS 8.

paridoth commented 2 years ago

Hi sorry to comment on this after so long, I picked up a Wyse 3040 off of ebay with the intent to use it as a pihole and nas for my home network. Attempting to install linux and I was running into issues every time it got to the formatting step, looks like it is related to this bug based on some of what I can understand from your comments back and forth, is there something I can do to fix this? I don't have a huge amount of knowledge on the subject but was hoping maybe you guys could share some insight?

ghost commented 2 years ago

@paridoth As I remember, I ended up dropping into the CLI and editing /usr/lib64/python3.6/site-packages/parted/device.py before attempting partitioning in the setup, and then it seemed to work fine.

Basically, I changed:

    @property
    def model(self):
        """Model name and vendor of this device."""
        return self.__device.model

To:

    @property
    def model(self):
        """Model name and vendor of this device."""
        try:
            return self.__device.model
        except UnicodeDecodeError:
            return "MMC H8G4a"

Definitely not the proper way to fix it, but it'll get you through the installer at least and seems to work fine without the device identifier being correct.

If that still isn't working for you, let me know, I can reinstall and document my process a bit better.

paridoth commented 2 years ago

Awesome I'll give that a shot tonight! I appreciate it!

paridoth commented 2 years ago

alright so I am using linux mint 20 xfce edition live installer, my device.py is under "/usr/lib/python3/dist-packages/parted" I assume this is still the correct one I need to edit. I found the above mentioned entry and replaced it. I get the error.

Traceback (most recent call last): File "/usr/lib/ubiquity/ubiquity/frontend/gtk_ui.py", line 843, in lambda: self.dbfilter.start(auto_process=True)) File "/usr/lib/ubiquity/ubiquity/filteredcommand.py", line 103, in start prep = self.prepare() File "/usr/lib/ubiquity/plugins/ubi-prepare.py", line 478, in prepare self.setup_sufficient_space() File "/usr/lib/ubiquity/plugins/ubi-prepare.py", line 503, in setup_sufficient_space free = self.free_space() File "/usr/lib/ubiquity/plugins/ubi-prepare.py", line 517, in free_space devices = proc.communicate()[0].rstrip('\n').split('\n') File "/usr/lib/python3.8/subprocess.py", line 1015, in communicate stdout = self.stdout.read() File "/usr/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 83: invalid start byte

Do I need to find a different device.py perhaps, or maybe edit something in the installer itself?

ghost commented 2 years ago

It doesn't appear to be a pyparted issue at this point, but looks to be parsing a shell command instead. This is still caused by bad handling of the eMMC device name not being UTF-8, and just happens to also be in python.

Try replacing:

            proc = subprocess.Popen(
                ['parted_devices'],
                stdout=subprocess.PIPE, universal_newlines=True)

With:

            proc = subprocess.Popen(
                'parted_devices | tr -d "\222"',
                stdout=subprocess.PIPE, universal_newlines=True, shell=True)

tr -d will strip characters, and \222 is octal for \x92 in hex, since tr only accepts special characters in octal.

paridoth commented 2 years ago

Thanks for sticking with me and helping out! I am trying to figure out what file I should replace the code with, I tied the device.py and codecs.py but couldn't find any matching code.

Edit: I found it under "ubi-prepare.py" and it worked! until it didn't, partman crashed under error code 141 but at least it's something different though. I am trying to get the full error message now

edit2: I wasn't able to copy the error so I took a a screen shot https://ibb.co/7pjhv8t

jstasiak commented 1 month ago

Hey all, #108 addresses the issue with the PyUnicode_FromFormat workaround.

dcantrell / pyparted

Exception when returning non-UTF-8 model names #76