handle weird case of UnicodeEncodeError crash when trying to read image file path

PetervanLunteren commented 11 months ago

I'm not sure why, but I kept getting UnicodeEncodeError crashes when run_detector_batch.py was trying to read im_file in print('Processing image {}'.format(im_file)). See traceback below.

Traceback (most recent call last):
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 1144, in <module>
    main()
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 1110, in main
    results = load_and_run_detector_batch(model_file=args.detector_file,
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 509, in load_and_run_detector_batch
    result = process_image(im_file, detector,
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 311, in process_image
    print('Processing image {}'.format(im_file))
  File "C:\Users\smart\miniforge3\envs\ecoassistcondaenv\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf022' in position 115: character maps to <undefined>

I couldn't really handle this error like you're doing with the other exceptions, for example:

            result = {
                'file': im_file,
                'failure': run_detector.FAILURE_IMAGE_OPEN
            }

since im_file is the one giving the exception. Hence I solved it a rather extreme way by not putting anything in the json file for that particular image. You can decide for yourself it you think that is a good idea. Perhaps you've seen the error before?

agentmorris commented 11 months ago

Thanks for reporting this. This seems like a case where there are unusual characters in some filenames... and I've seen plenty of unusual characters in filenames, and haven't his this issue, so maybe these are like super-duper-unusual? Would it be possible for you to share (by email) the files that caused this issue?

If at all possible, I'd prefer to maintain to the invariant that inference can fail for any number of reasons, but every image will be included in the output, so I'm hoping we can catch this a different way.

Also, note to selves: is this issue only about the print statement? I.e., does this go away if you use --quiet? That's not a solution, but it would be helpful in debugging the issue.

PetervanLunteren commented 11 months ago

If at all possible, I'd prefer to maintain to the invariant that inference can fail for any number of reasons, but every image will be included in the output, so I'm hoping we can catch this a different way.

I can understand that.

I'll further investigate and keep you posted. The problem is that I don't exactly know which file(s) are faulty, since it doesn't print its path. However, I can see the file that MD processed just before the error. Does MD process images in alphabetical order?

agentmorris commented 11 months ago

Yes, run_detector_batch.py enumerates images via find_images(), which sorts images before returning. FWIW that sorted() call is a relatively recent change; until recently, paths were returned in the order that comes out of glob.glob(), which is not guaranteed to be sorted.

If you can share images with me, I'm happy to take a look; if not, let me know via email if it's possible to just share the list of filenames. If this is in fact a filename issue, I can debug without the image content.

PetervanLunteren commented 11 months ago

Alright. I found the culprit by converting the path to a printable representation using repr(). It was some super-duper weird character in a folder name. We can convert to a printable representation first, and then write the path to the json with a failure like so:

try:
    print('Processing image {}'.format(im_file))
except Exception as e:
    im_file = repr(im_file) # convert to readable characters
    if not quiet:
        print('Image {} contains a special character and can\'t be loaded. Exception: {}'.format(im_file, e))
    result = {
        'file': im_file,
        'failure': run_detector.FAILURE_IMAGE_OPEN # probably need a custom failure here
    }
    return result

That will make sure the image is in the json. However, the path will not be original since it has been converted to a representation. What do you recon is a good idea? If you want I can share some problem files via email.

agentmorris commented 11 months ago

Good detective work. Yes, please send problem files by email, and let me know what OS you're working on.

agentmorris commented 11 months ago

Sigh, if only debugging were ever easy. I can't repro, even with the files you sent. I am on Windows 11, but it would be surprising if that were the important difference. A few questions:

Confirm that you're running with run_detector_batch.py (as opposed to calling the Python functions directly)?
What version of Python is the environment in question running? For posterity, I tested with both 3.8.15 and 3.11.
The top-level folder you sent has an unusual character in it... confirm that's the offending character, in which case all files in this folder should fail? I.e., run_detector_batch.py should fail immediately, on the first file in the folder?
Confirm that the language of your Windows installation is set to English?

If none of those questions turn up a difference between your environment and mine, I will track down a Win10 environment, in case that's an important difference.

PetervanLunteren commented 10 months ago

Confirm that you're running with run_detector_batch.py (as opposed to calling the Python functions directly)?

Yes, I call run_detector_batch.py via EcoAssist, which should be the same as via CLI.

What version of Python is the environment in question running? For posterity, I tested with both 3.8.15 and 3.11.

Python 3.8.15

The top-level folder you sent has an unusual character in it... confirm that's the offending character, in which case all files in this folder should fail? I.e., run_detector_batch.py should fail immediately, on the first file in the folder?

Correct. It should fail immediately.

Confirm that the language of your Windows installation is set to English?

I'm sorry. I believe that I told you that I was running Windows 10, but I just checked and I am running Windows 11 Pro. The language is indeed English (United States).

A few more things:

It regards the special character \uf022. I'm not sure if that remains the same if you upload it to google drive and then download it again. Does it look like this on your machine? How does the paths look in the output json?

When I try to run run_detector_batch.py over the problem dir on my M1 mac, it automatically converts the character to \uf022 and handles it correctly. The paths in the json file are "file": "Barab\uf022Obab/PvL_seq_045f2/0001.JPG". Also the postprocessing steps work fine, indicating that it can read these paths without problem.

This is the command and output to run run_detector_batch.py via EcoAssist on my Windows 11.

['C:\\Users\\smart\\miniforge3\\envs\\ecoassistcondaenv\\python.exe', 'C:\\Users\\smart\\EcoAssist_files\\cameratraps\\detection\\run_detector_batch.py', 'C:\\Users\\smart\\EcoAssist_files\\pretrained_models\\md_v5a.0.0.pt', '--output_relative_filenames', '--recursive', 'C:\\Users\\smart\\Desktop\\Barab\uf022Obab', 'C:\\Users\\smart\\Desktop\\Barab\uf022Obab\\image_recognition_file.json']

Fusing layers...
Fusing layers...
Model summary: 574 layers, 139990096 parameters, 0 gradients, 207.9 GFLOPs
Model summary: 574 layers, 139990096 parameters, 0 gradients, 207.9 GFLOPs
123 image files found in the input directory
PyTorch reports 1 available CUDA devices
GPU available: True
Imported YOLOv5 from PYTHONPATH
Using PyTorch version 1.10.1
Sending model to GPU
Loaded model in 3.92 seconds
Loaded model in 3.92 seconds

  0%|          | 0/123 [00:00<?, ?it/s]
  0%|          | 0/123 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 1047, in <module>
    main()
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 1013, in main
    results = load_and_run_detector_batch(model_file=args.detector_file,
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 494, in load_and_run_detector_batch
    result = process_image(im_file, detector,
  File "C:\Users\smart\EcoAssist_files\cameratraps\detection\run_detector_batch.py", line 311, in process_image
    print('Processing image {}'.format(im_file))
  File "C:\Users\smart\miniforge3\envs\ecoassistcondaenv\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf022' in position 45: character maps to <undefined>

As you can see, the image folder contains \uf022, which is handled without problems as it starts analysing the dir. The error only pops up when trying to access the image. Apparently accessing the dir is not the problem...

agentmorris commented 10 months ago

Yes, that's the same way the character renders on my system.

The error suggests this isn't an issue with loading or processing the image, rather an issue with printing to the shell. It's plausible that the behavior here would be different depending on the shell environment.

Can you try (a) running with --quiet, and (b) running without --quiet, but at a standard MiniForge or Anaconda prompt, outside of EcoAssist?

Can you also try changing:

print('Processing image {}'.format(im_file))

...to:

s = im_file.encode('cp1252', errors='replace').decode('cp1252')
print('Processing image {}'.format(s))

I suspect this will suppress the error, but it's not really a solution, since you would need to do this everywhere a filename is possibly printed.

Also, note to self, I just wanted to make sure that the default locale on my system (where the error doesn't occur) also has cp1252 as the default encoding, so I did this:

python -c "import sys,locale; print(sys.getdefaultencoding() + ' ' + str(locale.getdefaultlocale()))"

...and got this:

utf-8 ('en_US', 'cp1252')

PetervanLunteren commented 10 months ago

I have tried it with a clean slate. Running everything in a MiniForge Prompt.

mkdir c:\git
cd c:\git
git clone https://github.com/agentmorris/MegaDetector
git clone https://github.com/ecologize/yolov5/
cd c:\git\MegaDetector
conda env create --file envs\environment-detector.yml

The weird thing is that now it can't even find the folder, while that wasn't a problem when running it via EcoAssist in the Command Prompt.

Running it with --quiet resulted in:

(cameratraps-detector) c:\git\MegaDetector>python detection\run_detector_batch.py MDV5A "C:\Users\smart\Desktop\Barab\uf022Obab" "C:\Users\smart\Desktop\Barab\uf022Obab\test_output.json" --output_relative_filenames --recursive --quiet
Downloading model MDV5A
Downloading file md_v5a.0.0.pt to C:\Users\smart\AppData\Local\Temp\megadetector_models\md_v5a.0.0.pt...done, 280766885 bytes.
Traceback (most recent call last):
  File "detection\run_detector_batch.py", line 1144, in <module>
    main()
  File "detection\run_detector_batch.py", line 966, in main
    assert os.path.isdir(args.image_file), \
AssertionError: Could not find folder C:\Users\smart\Desktop\Barab\uf022Obab, must supply a folder when --output_relative_filenames is set

Runnin it without --quiet resulted in:

(cameratraps-detector) c:\git\MegaDetector>python detection\run_detector_batch.py MDV5A "C:\Users\smart\Desktop\Barab\uf022Obab" "C:\Users\smart\Desktop\Barab\uf022Obab\test_output.json" --output_relative_filenames --recursive
Downloading model MDV5A
Bypassing download of already-downloaded file md_v5a.0.0.pt
Traceback (most recent call last):
  File "detection\run_detector_batch.py", line 1144, in <module>
    main()
  File "detection\run_detector_batch.py", line 966, in main
    assert os.path.isdir(args.image_file), \
AssertionError: Could not find folder C:\Users\smart\Desktop\Barab\uf022Obab, must supply a folder when --output_relative_filenames is set

When adding the decode('cp1252') like you said, it resulted in the same error.

(cameratraps-detector) c:\git\MegaDetector>python detection\run_detector_batch.py MDV5A "C:\Users\smart\Desktop\Barab\uf022Obab" "C:\Users\smart\Desktop\Barab\uf022Obab\test_output.json" --output_relative_filenames --recursive
Downloading model MDV5A
Bypassing download of already-downloaded file md_v5a.0.0.pt
Traceback (most recent call last):
  File "detection\run_detector_batch.py", line 1145, in <module>
    main()
  File "detection\run_detector_batch.py", line 967, in main
    assert os.path.isdir(args.image_file), \
AssertionError: Could not find folder C:\Users\smart\Desktop\Barab\uf022Obab, must supply a folder when --output_relative_filenames is set

When I run python -c "import sys,locale; print(sys.getdefaultencoding() + ' ' + str(locale.getdefaultlocale()))", I also get utf-8 ('en_US', 'cp1252').

Also when I run the command in Command Prompt, it can't find the folder. When I remove the special character it works fine both in MiniForge and Command Prompt.

agentmorris commented 10 months ago

I think the "could not find folder" issues are unrelated; it looks like maybe you pasted the folder name into an environment that converted it into \u notation, then pasted into the shell. If you're still up for more experiments, can you instead use tab completion to enter the folder name into the shell for the --quiet experiment, so we know for sure the shell is doing the right thing with that character? This is how this looks in my Win11 Miniforge prompt:

clipboard

PetervanLunteren commented 10 months ago

You're absolutely right.

Running the command with and without --quiet worked without any problems in both MiniForge and Command Prompt. This indicates that it has to do something with EcoAssist (sorry!), and most likely with the subprocess.Popen() command.

After some research, I found that "... if you print to a Windows console, the string is internally encoded into the Windows console code page (cp1252). The special character is not represented in that code page. The default console is not really unicode friendly in Windows. There is little to do in a Windows console." link

I guess it'll work if I use MD's python tools instead of running a subprocess call, so it doesn't have to be printed to the windows console. For now, I'll just remove the character, and will put it on my to-do to incorporate MD's python tools.

Glad we found the issue! And sorry to have bothered you with it...

agentmorris commented 10 months ago

Thanks for debugging this, even if it's specific to EcoAssist, this is really helpful. Actually, this wasn't even specific to EcoAssist, it was specific to launching Python scripts with Popen(), which I do all the time, so I would definitely have hit this issue eventually, even if not wrt run_detector_batch.py.

Something still seemed off in the explanation from that link; if that link were correct about the console being the problem, running in a standard Miniforge prompt without --quiet should also crash, and it doesn't, instead the Windows console does a reasonable thing and shows a substitute character. So I did a little more exploring, and I replicated this issue in a test script that has nothing to do with MegaDetector:

https://github.com/agentmorris/agentmorrispublic/blob/main/character-encoding-test/character-encoding-test.py

I can repro under default conditions, and after trying a few things that I thought would work (in particular, playing with the "encoding" and "errors" parameters to Popen()), I found something that seems to work and may solve your problem without a major refactoring. Before calling Popen(), do this:

environ = os.environ.copy()
environ['PYTHONIOENCODING'] = 'utf-8'

...then add envs=environ to your Popen() call. I can't promise you that this doesn't have other side effects, but it solved the problem in the test I linked to above.

Credit to this discussion for this suggestion.

PetervanLunteren commented 10 months ago

Dan, you're the man! Even debugging problems outside MD ;)

It worked perfectly when I also added encoding='utf-8' to my Popen() call. This is the working code:

        environ = os.environ.copy()
        environ['PYTHONIOENCODING'] = 'utf-8'
        p = Popen(...
                  env=environ,
                  encoding='utf-8')

Thanks a lot! I appreciate it :)

agentmorris / MegaDetector

handle weird case of UnicodeEncodeError crash when trying to read image file path #118