agonzalezro / python-opensubtitles

A python wrapper for the opensubtitles API
ISC License
153 stars 41 forks source link

latin chars discarded #43

Open shei-pi opened 4 years ago

shei-pi commented 4 years ago

The detect function can't always guess the right encoding, so the decompress methods falls back to 'utf-8' and ignoring the errors. Therefore the latin chars get discarded. There should be an optional param to force the decoding to be used when decoding the data.

CosmicHorrorDev commented 4 years ago

It seems a bit silly to always fall back to utf-8 when no character detection libraries are installed. data actually contains the encoding specified when subtitles are uploaded (even though this can be wrong if the author set it wrong) with the key "SubEncoding" so I feel like that should really be the first choice to fall back on.

agonzalezro commented 4 years ago

That's a really good point @LovecraftianHorror, I didn't realise we had that info on the response.

@jprabino could you change your PR #44 for testing this?

Thanks to both!

shei-pi commented 4 years ago

Got it. Thanks @LovecraftianHorror!

shei-pi commented 4 years ago

@LovecraftianHorror can you give an example? I've been trying to test this but couldn't find where the "SubEncoding" key is supposed to be. We get self.data using: self.data = self.xmlrpc.DownloadSubtitles(self.token, ids)

Here's the result of that.

`self.data

'status':'200 OK' 'data':[{'data': 'H4sIAAAAAAAAA5V9zY4c...zzOeZotQAA', 'idsubtitlefile': '1955339152'}] 'seconds':0.06 len():3`

'data' inside self.data has the subtitles in base64, by the time we convert from base64 we should have the encoding already. In any case I checked and couldn't find the "SubEncoding" anywhere.

CosmicHorrorDev commented 4 years ago

Oh sorry I should have looked into the library more to see how downloading actually worked. The "SubEncoding" is returned on the results from SearchSubtitles not on the download information. You might be able to add another parameter for passing in encodings for the subtitle ids with a dictionary to make them optional? I'm not really sure about the best way to do this then.

If you want another reference there's me getting the "SubEncoding" here for a search result and using it here for downloaded subtitles. I recently changed downloading to just directly download the content instead of decoding it which avoids the error entirely. It seems like you should be able to do that here except for when return_decoded_data is True which would force you to decode it. Maybe changing to that would be the best course of option since I doubt many people use return_decoded_data, but it would still be good to have a way to specify the decoding.

Hopefully, that gives some options. I should have looked into the code a bit more before commenting the first time.

tiago-portugal commented 2 years ago

Thanks for the insights. 👍