Encode int as string - Githubissues

lmazuel commented 3 months ago

Problemm: We have a service team that requires string on the wire, but this is actually always an integer. Given the simplicity of converting string to int (and vice versa), we want to add support for declaring that conversion, and get emitters supporting it, to avoid manual code for service team cross-language.

Edit:

After discussion, it became clear that the encode parameter is an encoding, and therefore shouldn't be string, but something describing what the encoding will look like. In the usual case, we would want base-10, which would serialize 12 as "12". As base-10 is technically "decimal", we could use that term, but it might be conflated with the decimal type. The current consensus seems to be number, as a fair semantic of base-10:

model Foo {
  @encode("number")
  x: int
}

Original proposal at creation:

model Foo {
  @encode("string")
  x: int
}

### TypeSpec
- [ ] https://github.com/microsoft/typespec/issues/3856

### Spec
- [ ] https://github.com/Azure/cadl-ranch/issues/634
- [ ] https://github.com/Azure/typespec-azure/issues/1177

### Implementation
- [ ] https://github.com/Azure/typespec-azure/issues/1178
- [ ] https://github.com/Azure/autorest.python/issues/2703
- [ ] https://github.com/Azure/autorest.java/issues/2863
- [ ] https://github.com/Azure/autorest.typescript/issues/2676
- [ ] https://github.com/Azure/autorest.csharp/issues/4919
- [ ] https://github.com/Azure/autorest.go/issues/1402
- [ ] https://github.com/Azure/autorest.cpp/issues/364
- [ ] https://github.com/Azure/autorest.rust/issues/47

ArthurMa1978 commented 3 months ago

Hey @lmazuel , in the past, we encountered similar issue where a property was defined as a string, but its value was in an integer format. We requested the service team to correct the Swagger spec, as this was identified as a definition bug. Why are we now considering this a feature for TypeSpec?

timotheeguerin commented 3 months ago

one use case is representing int64 without loosing precision(for JS) or decimal

weidongxu-microsoft commented 3 months ago

Below discussion is on the condition that emitter is to support a general feature.

Please let dev know whether we have such requirement, besides that Batch bug (where backend should send an int, but instead send a string and cannot fix that due to backward-compatibility requirement).

For the Batch bug, I'd recommend Batch service write contentLength: string in typespec, as they do return it as a JSON string (spec is wrong, and same error on their pre-typespec Swagger). Service could do customization for SDK on the getter of the property.

The full typespec would be

model Foo {
  @encode("string", string)
  x: int64;
}

Question:

Do we only allow the 2nd parameter to be string for this use case? That is, we only allow client conversion from basic scalar type to string type. Or do we also allow e.g. @encode("float", float64) x: int32; We probably want a lint rule to avoid service write arbitrary scalar conversion like @encode("string", url) x: int32;
Do we have a better name at @encode("string", )? E.g. on date-time, we write @encode("rfc7231"), which is very clear, that it is date-time encoded as rfc7231 text.

A few proposals, of different scope of support

Only allow @encode("string", string) x: int64 and @encode("string", string) decimal (include decimal128). Possible reason: certain language may not able to process these types (e.g. JS can only process safeint which is smaller than int64). Lint rule (in typespec-azure-core) to disallow all other options.
Allow wire type be string, and client type be basic scalar. Possible reason: a float32 can always be represented as a string (however there could be multiple ways of such representation, e.g, "1000.0" vs "+1000.00" vs 1,000.0, and backend must accept all of them as equivalent if the property is used as input), but not vice versa. Downside is that there is not really a RFC like rfc3339 on what string representation is acceptable, what is not -- there is no guarantee a language-specific Float.parseFloat can handle all possible float-in-string-format.
Allow wire type be any basic scalar, and client type too. If in runtime, a "string" on wire cannot be converted to "int32" on client, throw exception. Downside, this probably going too far.

PS: by "basic scalar", I include "numeric types", "string", "boolean" from built-in-types

MaryGao commented 3 months ago

In the general use cases, most often the numbers we are talking about would fit well inside a double's range and precision.

With above assumption, JS decided to use number to represent decimal(see decision). So if service team really cares about the precision loss, it would be recommanded to use @encode (typespec issue).

So talking about encode with int epic, do we have similar scenario where we really care about precision or some other reasons? Actually we have lots of definition for "type": "integer", "format": "int64" in swagger, but JS just generates them as number in HLC and it works well. My understanding is data range outside JS could represent may not be common cases.

Back to weidong's proposal, I would prefer the option 1 with only allowing encoding this with necessary types e.g int64 or decimal. I haven't see any values with case @encode("string", string) x: int8 and obviously service bug may not be a good reason. Another point is this may more align with what OpenAPI is trying to support because their format int64 has base type with number and string(link and issue).

timotheeguerin commented 3 months ago

Yeah this encoding would only be allowed to target string(in the same way that rfc7231 will complain if you don't target a string)

Additionally same comment as before I don't think string is the right key. We pretty much always encode as a string. I think "decimal" might be more accurate(as you could imagine an hex or octal encoding as well) but it could be confusing with the decimal type

lmazuel commented 3 months ago

@weidongxu-microsoft you're on point that this is contentLength of Batch ;)

Agree with you all, let's keep things simple and accept only the scenario string/int for now, there is no need to get complicated. And if someone one day comes with another conversion scenario, we can investigate at time the validity of that scenario.

m-nash commented 3 months ago

@timotheeguerin are you suggesting @encode("decimal", string)? Meaning the type on the wire is a string, but the encoding is decimal format meaning a floating point number?

I kind of get this because decimal narrows the domain of the wire type string, much like rfc3339 does.

Is there a list of known name of an encoding as referenced by these docs so we know which values are available to choose from for this use case?

lmazuel commented 3 months ago

I understood @timotheeguerin as:

@encode("decimal", string)
foo: int32

would convert 12 to "12", while

@encode("oct", string)
foo: int32

would convert 12 to "0o14"

IOW, still use the string parameter to describe the format you want it to look like, and usually we use base 10 (or decimal)

m-nash commented 3 months ago

Might be confusing to use decimal as this is usually a floating point number in most languages? But yeah that tracks with what I thought the suggestion was, and I agree with the direction.

timotheeguerin commented 3 months ago

No not decimal as floating point decimal as decimal notation of numbers(how you write numbers normally)

Playground

But yes that was my point that decimal might be confusing

timotheeguerin commented 3 months ago

Could maybe be

decimal-str
decimal-string
decimal-notation

Also on a related note with encoding we are thinking that using string values it not great and it is better to keep the enum values as the only way for new ones

So could have that(name tbd with above discussion as well)

enum NumericEncoding {
  decimal,
  hexdecimal,
  octal,
  binary,
}

MaryGao commented 3 months ago

let's keep things simple and accept only the scenario string/int for now, there is no need to get complicated. And if someone one day comes with another conversion scenario, we can investigate at time the validity of that scenario.

@lmazuel except int not sure we could scope this a little to consider the decimal type also? I believe this would be useful for service team who cares precision.

weidongxu-microsoft commented 3 months ago

Maybe we could limit this to integer (int32, int64, safeint) to string, for now. There could be some nuance of the "decimal" representation, but may not be too bad.

If we allow float to string, there could be more questions, e.g. whether 2e10 a valid "decimal" for float -- looks not?

timotheeguerin commented 3 months ago

Missed the meeting notes here, problem isn't anymore about the decimal naming as much as deciding what to do all together.

THe current proposal is to have

enum NumericEncoding {
  base10,

 // With option to add the following in the future
  // base16,
  // base 8
}

That you can then use

@encode(NumericEncoding.base10)
a: int64;

The current question is

Should we consider that those non safe values for http services be seriaalized as string by default. I personnaly think this is not a great idea on top of being a breaking change
Should we maybe just allow @encode with no argument to mean encode to string with the default string encoding.
- this does sound interesting but it also feels like this is the only use case. All other use of @encode is when we want to change the default encoding because that type already needed to be encoded to a string implicitly(utcDateTime, plainTime, etc.)

timotheeguerin commented 3 months ago

Proposed changed signature for decorator

extern dec encode(
  target: Scalar | ModelProperty,
  encoding: string | EnumMember,
  encodedAs?: Scalar
);
extern dec encode2(target: Scalar | ModelProperty, encodedAs: Scalar);
extern dec encode2(
  target: Scalar | ModelProperty,
  encodingOrEncodedAs: string | EnumMember | Scalar,
  encodedAs?: Scalar
);

And then you can call

@encode(string) val: int64

restrict this overload format to only numeric scalar types and string as the encodeAs

weidongxu-microsoft commented 3 months ago

IMHO, I'd avoid @encode(string) directly, if possible. If we had to, limit it only to a subset of scalar that there is a consensus of what is acceptable as string representative.

I think we should be clear about the conversion rule (it need not to be an RFC, but we should be explicit about what is a valid string representation, what is not.

I am not aware that there is consistency between different languages, when convert string to/from a scalar.

Say convert String to Boolean, when de-serialize the JSON). Java's Boolean.parseBoolean handles true case-insensitively, and everything else is false. https://docs.oracle.com/javase/8/docs/api/java/lang/Boolean.html#parseBoolean-java.lang.String-

,NET's Boolean.TryParse only handle True or False. Exception if other values. https://learn.microsoft.com/en-us/dotnet/api/system.boolean.tryparse?view=net-8.0#system-boolean-tryparse(system-string-system-boolean@)

Golang also accept t f and 1 0, case-insensitive. Exception if other values. https://pkg.go.dev/strconv#ParseBool

timotheeguerin commented 3 months ago

We are limiting that to numeric types and it means the decimal representation of the number as if you were to serialize the number to a json string

weidongxu-microsoft commented 3 months ago

Yeah, that works. Or we may say that JSON numeric wrapped in quote is acceptable.

MaryGao commented 3 months ago

Maybe irrelevant to the detailed design. But I'd like to clarify one thing - what does the typespec type stand for?

model Foo {
  @format("date-time-rfc1123") data1: string;
  @encode(string) val: int64;
}

with @format, we have string as wire type and emitter could leverage format hint to define client type
with @encode, we have int64 as client type and emitter could leverage encode to understand wire type

timotheeguerin commented 3 months ago

@format is to be seen as a "known pattern" ( a known string validation format) thats it it doesn't affect the type. Here data1 is a string and should always be presented as a string. There is just some decorator that say you should validate this string is in this format(which our client do not do so basically @format is not something you should ever be concerned with)

MaryGao commented 3 months ago

Thanks for the explaination and it makes sense considering we have guideline on no client-side validation. So can I understand typespec type stands for client type or the type we want to show to our customers?

timotheeguerin commented 3 months ago

Not sure I see the distinction here client type or the type we want to show to our customers

The typespec type represent the logic type. In some client/server/other target you might not be able to represent it as that logical type so you might choose to keep it as the wire type(e.g. @encode(string) a: int64 in JS)

MaryGao commented 3 months ago

I think the logic type is what I mean here. One more question - with below example we would expect different logic type if emitter could somehow differentiate the utcDateTime or offsetDateTime, but on the wire they are the same format?

model Foo {
  data1: utcDateTime;  // RFC3339 for json (default)
  @encode(DateTimeKnownEncoding.rfc3339) data2: offsetDateTime;
}

timotheeguerin commented 3 months ago

Yeah they both use RFC3339 but data1 should always end with Z and data2 might have the offset (e.g.+8:00)

Azure / typespec-azure

Encode int as string #1179