arvidn / libtorrent

an efficient feature complete C++ bittorrent implementation
http://libtorrent.org
Other
5.26k stars 996 forks source link

Python bindings: deprecate nonstandard inputs to bencode #5991

Open AllSeeingEyeTolledEweSew opened 3 years ago

AllSeeingEyeTolledEweSew commented 3 years ago

bencode() currently accepts any type as input, including object.

bdecode() only outputs dict, list, int or bytes. This means there are lots of values that aren't stable round-trip, that is v != bdecode(bencode(v)).

It's not clear this support is helpful for any use case, and it can hide mistakes.

bencode() should only accept combinations of dict, list, int or bytes. Any others should fire DeprecationWarning, and should eventually raise TypeError at some later time.

AllSeeingEyeTolledEweSew commented 3 years ago

I only just realized that tuples of ints (100, 100, 100) are treated as preformatted inputs to bencode(). I guess round-trip stability is intentionally bypassed here. I think tuples-of-ints is a pretty unfriendly way to present this in python; I'll have to think more about the best approach here.

In the meantime at least fleat and object should be deprecated.

arvidn commented 3 years ago

you don't think it would be reasonable to accept string as well, and encode it as UTF-8 into bytes?

arvidn commented 3 years ago

how does this look? https://github.com/arvidn/libtorrent/pull/6125

arvidn commented 3 years ago

@AllSeeingEyeTolledEweSew that patch, #6125, would you consider that to address this issue?

Also, which one of your PRs do you think makes the most sense to land next?

AllSeeingEyeTolledEweSew commented 3 years ago

I think str should be deprecated. I could probably do the patch. Sorry, I meant to do more patches to fix all the bugs I opened but I got momentum doing my own project.

6188 is definitely my vote for what to do next.

arvidn commented 3 years ago

I still don't understand the rationale for deprecating str. In a way it simplifies usage, since the caller doesn't need to know that strings are supposed to be UTF-8 encoded.

On the other hand, accepting strings makes the API non-symmetric. The string won't round-trip.

What is the reason to deprecate str in your mind?

AllSeeingEyeTolledEweSew commented 3 years ago

I think supporting str is an impedence mismatch with python 3.

Python 3 considers that string encodings should be explicit. Typically using a str input in the wrong place raises an explicit error, rather than implicitly encode the string. A user accustomed to this might expect that if they accidentally use a str somewhere in the input to bencode, that it should also raise an error. If we helpfully implicitly encode, the user could get themselves in trouble later.

If we were still in the python 2 world, I would want to support str inputs the way we do now, to match the environment.

arvidn commented 3 years ago

but bencoding does support strings, and they are supposed to be encoded as utf-8 (by the spec). In a way, one could argue that leaving the string encoding to the user opens up the possibility to generate invalid bencoded structures. In fact this used to be a problem in the early days, where people would encode filenames with arbitrary, local, code pages.

I think it's a stretch to say a str in a bencoded structure is "the wrong place". But I appreciate that it might be a better match to have bencoding and bdecoding the at the same abstraction level.

AllSeeingEyeTolledEweSew commented 3 years ago

I believe that accidental or ill-considered use of str is more common than intentional use of str to mean "encode this as utf-8", as you describe. I don't have data for this assertion, but I do know that considerations like this came up a lot in python 2, and many people smarter than me decided the best course was to make encoding explicit in python 3.

Also, remember that not all strs encode to utf-8 cleanly. Invalid surrogate pairs exist, and this is used as a "feature" for pathname handling. An app author might use a str for a pathname passed to bencode(). But if the str came from the filesystem it will have been treated with the filesystem encoding, and it may not encode to utf-8. There are examples of this for both windows and linux. The proper handling is something like os.fsencode(str_path).decode("utf-8", "replace"). If we allow str inputs and the author only tests with ascii filenames, then the failure will only be found by an unlucky user.

Lastly, as you mention, I think it's best to have bencode() and bdecode() match each other to minimize surprise.

So I still vote to deprecate str.