Issue with decoding .html file

Jessime / youtube_history

A quick analysis of all Youtube videos in a user's history.

MIT License

83 stars 4 forks source link

Issue with decoding .html file #11

Open vgzhn opened 3 years ago

vgzhn commented 3 years ago

Welcome! Extracting video urls from Takeout. Traceback (most recent call last): File "youtube_history.py", line 369, in <module> analysis.run() File "youtube_history.py", line 348, in run self.download_data() File "youtube_history.py", line 155, in download_data soup = BeautifulSoup(watch_history.read_text(), 'html.parser') File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2288.0_x64__qbz5n2kfra8p0\lib\pathlib.py", line 1236, in read_text return f.read() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2288.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3321: character maps to < undefined>

I've tried locating position 3321 and couldn't find anything obvious to remove, also inserting "file = open(filename, errors="ignore")" didn't work for me.

I'm an absolute beginner with python. Maybe that could be avoided by using the .json takeout?

Jessime commented 3 years ago

Hey, @vgzhn, I'd assume that the .json file would have the same character somewhere. Is there any chance you could share the file/contents? I'm not sure how to debug without playing around with the data.

Luminoxis commented 3 years ago

I had the same error, but changing the beautifulsoup encoding to utf-8 seemed to fix it soup = BeautifulSoup(watch_history.read_text(encoding="'utf-8'"), 'html.parser') After this the code got about 10000 results deep when it came across a very similar error (which i unfortunately lost) but for a different position (not 3321) some googling implied that changing the readline formatting to latin-1 would help line = p.stdout.readline().decode('latin-1').strip() which gave me Matplotlib is building the font cache using fc-list. This may take a moment restarted the shell, and now gives, OSError: invalid face handle This may be unrelated since changing the encodings back still leaves the error which it wasnt having before, but honestly Im not sure what its doing anymore, and I may just have messed it up somehow

rstebee commented 2 years ago

I'm having the same exact problem and I have no idea what to do

Jessime commented 2 years ago

I'm having the same exact problem and I have no idea what to do

@rstebee if you can post some or all of the data that's causing trouble, that'll help a lot.

barbatoz0220 commented 2 years ago

Hi @Jessime,

First of all, thank you very much for this awesome work.

I was trying out this project last night and got into the exact error posted here, and after a bit of looking around I found 2 threads on StackOverflow that helped me with finding a workaround:

So, in the file youtube_history.py, I went to the line soup = BeautifulSoup(watch_history.read_text(), 'html.parser') and modified it as follows:

with open(watch_history, encoding='utf8') as history:
   soup = BeautifulSoup(history, 'html.parser', from_encoding="utf8")

It seems like the watch-history.html was encoded in UTF-8, and, like the error said, the default encoding of Windows machines could not decode the character 0x9d, which is the " (right double quote) character.

Hope this helps with your problem. I'm also a Python noob so please feel free to propose a better solution 👏 .

Jessime commented 2 years ago

Hey all, this commit should fix things up:

https://github.com/Jessime/youtube_history/commit/615b48fe255d8cee7c006310db46087af08819ad

Thanks for reporting the issues!