P403n1x87 / austin-python

Python wrapper for Austin, the CPython frame stack sampler.
GNU General Public License v3.0
33 stars 5 forks source link

mojo2austin expecting utf-8, found latin-1 #30

Open dooferlad opened 2 months ago

dooferlad commented 2 months ago

Description

Running mojo2austin on a file I just generated gives an error:

Traceback (most recent call last):
  File "/home/dooferlad/.venv/bin/mojo2austin", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dooferlad/.venv/lib/python3.11/site-packages/austin/format/mojo.py", line 541, in main
    for event in MojoFile(mojo).parse():
  File "/home/dooferlad/.venv/lib/python3.11/site-packages/austin/format/mojo.py", line 507, in parse
    for e in self.parse_event():
  File "/home/dooferlad/.venv/lib/python3.11/site-packages/austin/format/mojo.py", line 492, in parse_event
    for event in t.cast(dict, self.__handlers__)[event_id](self):
  File "/home/dooferlad/.venv/lib/python3.11/site-packages/austin/format/mojo.py", line 469, in parse_string
    value = self.read_string()
            ^^^^^^^^^^^^^^^^^^
  File "/home/dooferlad/.venv/lib/python3.11/site-packages/austin/format/mojo.py", line 331, in read_string
    return self.read_until().decode()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte

Steps to Reproduce

  1. Run a Django project: austin --output /home/dooferlad/supportsite.austin --binary --heap=2048 ./manage.py runserver 8001 --noreload --skip-checks
  2. mojo2austin /home/dooferlad/supportsite.austin /home/dooferlad/supportsite.austin-txt

Versions

My environment is set up with:

LANGUAGE=en_GB.UTF-8
LANG=en_GB.UTF-8

Additional Information

To get mojo2austin and austin2speedscope to work, I made these changes: In austin/format/mojo.py at line 331:

    def read_string(self) -> str:
        """Read a string from the MOJO file."""
        return self.read_until().decode(encoding="latin-1")

And also austin/stats.py line 419:

    def __enter__(self) -> "AustinFileReader":
        """Open the Austin file and read the metadata."""
        self._stream = open(self.file, encoding="latin-1")

I assume that the string in the Mojo file is from the Python application, but I don't actually know. I am not sure if the above change is actually a fix or just masking the real bug!

P403n1x87 commented 2 months ago

@dooferlad thanks for reporting this. Does this happen with every MOJO file generated by Austin? Sometimes some files might be corrupted because of bad samples so it's worth trying collecting them again.

dooferlad commented 2 months ago

It definitely is happening every time for this project. I get a lot of invalid samples, so I suppose this is something I just have to live with?

⌛ Sampling duration : 14.00 s
⏱️  Frame sampling (min/avg/max) : 24/208/20201 μs
🐢 Long sampling rate : 438/12575 (3.48 %) samples took longer than the sampling interval to collect
💀 Error rate : 2867/12575 (22.80 %) invalid samples
dooferlad commented 2 months ago

Of course, I say that and then I tried the latest github release instead of the latest snap and the conversion worked. I still have a lot of invalid samples though!

dooferlad commented 2 months ago

Yes, it seems like invalid samples are the problem. Re-running multiple times gives me a selection of bytes that can't be decoded as UTF-8 arriving at different positions in the profile. I already have a large heap (4GiB). Is there anything else I can do to reduce errors?

dooferlad commented 2 months ago

FWIW, this does the right thing, I think!

# austin/stats.py line 428
    def __iter__(self) -> Iterator:
        """Iterator over the samples in the Austin file."""

        def _() -> Generator[str, None, None]:
            assert self._stream_iter is not None

            while True:
                try:
                    line = self._stream.readline()
                    if line == "\n":
                        break
                    yield line
                except UnicodeDecodeError:
                    pass

            self._read_meta()

        return _()

Would you like me to submit a PR?

P403n1x87 commented 1 month ago

@dooferlad yes please, any contribution is very welcome. As for the invalid samples, if you're specifically referring to the stats reported by Austin at the end, there isn't much that can be done about that. That's just the nature of an out-of-process profiler like Austin.