Skallwar / suckit

Suck the InTernet
Apache License 2.0
735 stars 38 forks source link

Quoting issue on charset detection #144

Closed marchellodev closed 2 years ago

marchellodev commented 3 years ago

When trying to scrape ted.com:

2021-07-08 15:14:04.662214650 +03:00: [WARN] Charset 'utf-8' not supported for https://ted.com/attend/ted-on-screen, defaulting to UTF-8
2021-07-08 15:14:05.480213483 +03:00: [WARN] Charset 'utf-8' not supported for https://ted.com/participate/nominate, defaulting to UTF-8
2021-07-08 15:14:06.454700388 +03:00: [WARN] Charset 'utf-8' not supported for https://ted.com/participate/organize-a-local-tedx-event, defaulting to UTF-8
2021-07-08 15:14:07.116944837 +03:00: [WARN] Charset 'utf-8' not supported for https://ted.com/participate/translate, defaulting to UTF-8

Not a big deal, but kind of annoying. Also, I am not really sure how to fix it, since this method returns 'utf-8', not utf-8 https://github.com/Skallwar/suckit/blob/master/src/scraper.rs#L99

Skallwar commented 3 years ago

That's weird. maybe we will need to unquote

marchellodev commented 3 years ago

Yeah, removing ' works, but I don't think that's the most elegant solution :)

Skallwar commented 3 years ago

We just need to update the regex to remove potential ', no big deal