Open AllSeeingEyeTolledEweSew opened 3 years ago
I only just realized that tuples of ints (100, 100, 100)
are treated as preformatted inputs to bencode()
. I guess round-trip stability is intentionally bypassed here. I think tuples-of-ints is a pretty unfriendly way to present this in python; I'll have to think more about the best approach here.
In the meantime at least fleat
and object
should be deprecated.
you don't think it would be reasonable to accept string as well, and encode it as UTF-8 into bytes?
how does this look? https://github.com/arvidn/libtorrent/pull/6125
@AllSeeingEyeTolledEweSew that patch, #6125, would you consider that to address this issue?
Also, which one of your PRs do you think makes the most sense to land next?
I think str
should be deprecated. I could probably do the patch. Sorry, I meant to do more patches to fix all the bugs I opened but I got momentum doing my own project.
I still don't understand the rationale for deprecating str
. In a way it simplifies usage, since the caller doesn't need to know that strings are supposed to be UTF-8 encoded.
On the other hand, accepting strings makes the API non-symmetric. The string won't round-trip.
What is the reason to deprecate str
in your mind?
I think supporting str
is an impedence mismatch with python 3.
Python 3 considers that string encodings should be explicit. Typically using a str
input in the wrong place raises an explicit error, rather than implicitly encode the string. A user accustomed to this might expect that if they accidentally use a str
somewhere in the input to bencode
, that it should also raise an error. If we helpfully implicitly encode, the user could get themselves in trouble later.
If we were still in the python 2 world, I would want to support str
inputs the way we do now, to match the environment.
but bencoding does support strings, and they are supposed to be encoded as utf-8 (by the spec). In a way, one could argue that leaving the string encoding to the user opens up the possibility to generate invalid bencoded structures. In fact this used to be a problem in the early days, where people would encode filenames with arbitrary, local, code pages.
I think it's a stretch to say a str
in a bencoded structure is "the wrong place". But I appreciate that it might be a better match to have bencoding and bdecoding the at the same abstraction level.
I believe that accidental or ill-considered use of str
is more common than intentional use of str
to mean "encode this as utf-8", as you describe. I don't have data for this assertion, but I do know that considerations like this came up a lot in python 2, and many people smarter than me decided the best course was to make encoding explicit in python 3.
Also, remember that not all str
s encode to utf-8 cleanly. Invalid surrogate pairs exist, and this is used as a "feature" for pathname handling. An app author might use a str
for a pathname passed to bencode()
. But if the str
came from the filesystem it will have been treated with the filesystem encoding, and it may not encode to utf-8. There are examples of this for both windows and linux. The proper handling is something like os.fsencode(str_path).decode("utf-8", "replace")
. If we allow str
inputs and the author only tests with ascii filenames, then the failure will only be found by an unlucky user.
Lastly, as you mention, I think it's best to have bencode()
and bdecode()
match each other to minimize surprise.
So I still vote to deprecate str
.
bencode()
currently accepts any type as input, includingobject
.bdecode()
only outputsdict
,list
,int
orbytes
. This means there are lots of values that aren't stable round-trip, that isv != bdecode(bencode(v))
.It's not clear this support is helpful for any use case, and it can hide mistakes.
bencode()
should only accept combinations ofdict
,list
,int
orbytes
. Any others should fireDeprecationWarning
, and should eventually raiseTypeError
at some later time.