DealProposal's Label field serializes as non-UTF8

ec2 commented 4 years ago

The Label field in DealProposal is a string that is not guaranteed to be UTF8 as required in the CBOR spec. This causes issues for other client implementations (like us). It will take a LOT of effort for us to work around this bug.

I would suggest somehow enforcing UTF-8 or switching the field to be arbitrary bytes instead.

anorth commented 4 years ago

I haven't investigated deeply, but it sounds like the cbor-gen generated code is permitting non-UTF8 bytestrings in this field.

cc @ZenGround0 @rvagg

rvagg commented 4 years ago

I think @willscott might be on the bleeding edge of this one with this work. Also /cc @warpfork @ribasushi as we've collectively been circling around these concerns since Filecoin has been mildly pushing at the definition of "strings" in some situations (thanks mainly to the simple Go mapping of string:[]byte) that introduce some cross-impl difficulties. And /cc @vmx for the Rust -- see linked ChainSafe issue which relates to what you've been working on.

I'll put this on the agenda for our IPLD meeting tomorrow https://github.com/ipld/team-mgmt/#weekly-call although we can't resolve Filecoin concerns there, but @willscott et. al. might have insights that help us suggest a way forward or at least comment further and maybe document some of this at least.

ribasushi commented 4 years ago

Do note that Label is designed to take user-input essentially directly. Adding any type of enforcement in the path populating this field would be... difficult. For comparison the Send params are declared []byte https://github.com/filecoin-project/lotus/blob/v0.10.2/chain/types/message.go#L43

mikeal commented 4 years ago

We talked about this in the IPLD team call.

We recommend moving to a byte value instead of a string. As you can see, encoding invalid UTF8 in string values will break in some languages. This is a language cross-compatibility problem for which there is no good solution other than “don’t encode non-UTF8 data in strings when you’re in languages that allow you to have invalid UTF8 in your strings.”
Given that these are already on-chain, there’s a bit of an open question about how to best manage dealing with the existing data. The solution is going to vary quite a bit by language and toolchain. @vmx may have more to say about the best way to deal with this in Rust where we’re currently seeing it. This is less of an issue in Go but once you migrate to a new value type handling the variation in value types is something @warpfork is integrating into go-ipld-prime schemas. JS shouldn’t have any problems, strings are UTF16 and can have invalid characters, and since it’s not strongly typed there’s no concern handling a value type variation in IPLD.

anorth commented 4 years ago

Ok, representing this as []byte sounds good.

We can't make this change until we do a major-version upgrade whereupon we can change state schema. At that point we can also migrate all deals in the chain state to encode as []byte instead of string. But nothing can change the fact that prior to that migration epoch they will be encoded as strings.

Changing might also be a pain for the markets modules and other software that deals in deals, which will have to handle pre-migration and post-migration values distinctly. We can't do this without buy-in from the teams who will wear that pain on the currently-in-use software. cc @hannahhoward @arajasek @whyrusleeping

austinabell commented 3 years ago

what is the reason for this being removed from v3 milestone @ZenGround0 ?

anorth commented 3 years ago

I'm not sure of Zen's explicit reason, but the focus of v3 (and likely v4 too) is chain efficiency, attempting to remove the bottleneck that is the chain validation throughput (i.e. gas usage). It's a rather pressing problem until we can improve throughput to match miner sealing hardware.

rvagg commented 3 years ago

To add some more weight for getting this changed - it turns out that it's as difficult in JavaScript as it is in Rust to deal with this. We currently have no way to reliably round-trip these blocks in our JavaScript stack since the standard bytes->string conversion methods available all deal with invalid UTF-8 characters by inserting U+FFFD. We can't really even reliably make use of the data if we decode it straight.

We're talking about strategies for doing a reach-around for special cases like this but there's no clear way to deal with it as the standard case (which is why we've hardened up on the SHOULD for Strings as UTF-8 in the IPLD Data Model https://specs.ipld.io/data-model-layer/data-model.html#kinds-reference). But it would be a special case thing where you know ahead of time that you're dealing with a block that has dodgy strings in it so you'd either explicitly ask to have strings dealt with as bytes, or request an alternative decoding scheme that can get you closer to the bytes you care about, such as latin1.

austinabell commented 3 years ago

I'll just add my motivation for this, aside from the fact that it's invalid Cbor based on the spec (https://tools.ietf.org/html/rfc7049#section-2.1). The main reason is that we can't validate that any string that gets decoded from Cbor is valid utf-8 and can introduce bugs in the future.

I get that the transition and the removal of unsafely decoding strings cannot happen soon now that it's baked into the v0 and v2 actors, but would be nice to remove ASAP so it hopefully won't be a permanent problem in the future. If this gets changed, it allows us to validate the whole protocol does not include this. This was the only occurrence of invalid utf8 when it was introduced, and the value is never being used.

These interactions may not be a big deal for the go implementations since the behaviour might be expected or deterministic, but causes issues for other implementations to match. Also, external tooling or interfacing of this data may cause inconsistencies (for example through JSON because it also only supports unicode)

Having said this, I don't have a problem if this doesn't change and understand why it wouldn't, just wanted to say my piece for why I think the change should be made.

ZenGround0 commented 3 years ago

While not critical to do right now I would like to get this scheduled as a fix for v4.

I need some help understanding how extensive a change this is on the lotus side. @magik6k @arajasek could you estimate how much effort will need to go into supporting a change to the on chain deal proposal format? Relatedly are proposal cids used by the markets code extensively?

laudiacay commented 3 years ago

So was the decision just slinging around bytes, or erroring if it's non-UTF8?

ZenGround0 commented 3 years ago

~The decision is still yet to be made but as long as it is feasible I'd like to filter non-UTF8 strings from the deal label field. Probably as part of deal validation.~

~The second part of this is remove non-UTF8 deal labels from state entirely. It's definitely easiest to just let the proposals with this property expire, but we could also do a migration to zero out these labels, or do some sort of reduction down to UTF8.~

Rereading this issue properly and the consensus to encode as bytes is better than what I just said. This has the nice property of all deal labels being unambiguously convertible in a state migration.

--edit--

The fact that the proposal signature never makes it on chain makes changing the serialization to cbor byte string acceptable imo. But this will need a FIP and lots of comms to users of the storage market so that we don't break tooling.

ribasushi commented 3 years ago

Aye, going with []byte also normalizes to a single type all our custom-input fields ( DealProposal.Label and Message.Params )

laudiacay commented 3 years ago

ok cool got a PR ready to go/passing CI for this

laudiacay commented 3 years ago

Done in code, merged into next, FIP hopefully coming soon!!

ZenGround0 commented 2 years ago

FIP 0027 is currently WIP. To read more about the approach we are taking please read FIP 0027 There are four major work items to get this implemented for the upcoming v16 network upgrade

[ ] Create a label type that serializes as a union of cbor byte string and cbor text string PR ready for review
[ ] The same change must be propagated to builtin rust actors. PR ready for final review
[ ] Write a migration that checks labels for utf8 invalidity and converts those labels to byte strings. All proposals with changed serializations will have changed cids and the market pending proposals map must be updated in these cases. This was originally done last year in this closed PR but was reverted for integration reasons. We can repurpose this PR without much rewriting. Current PR here
[ ] Lotus integration of the new label type. A WIP PR is open here

filecoin-project / specs-actors

DealProposal's Label field serializes as non-UTF8 #1248