cognitect / transit-format

A data interchange format.
1.88k stars 36 forks source link

Support for tuple and atom types? #29

Closed okeuday closed 9 years ago

okeuday commented 9 years ago

In different programming languages, tuples and atoms have usage separate from the types transit currently supports. It would be helpful if the transit protocol was extended to support tuples which become arrays in programming languages that lack a tuple type. It would also be helpful if the transit protocol was extended to support atoms which become binaries in programming languages that lack an atom type. Is there any reason why transit would not change to support tuples and atoms?

GetContented commented 9 years ago

@okeuday by "atoms" do you mean what most LISPs or Scheme or Erlang usually mean by atom? The reason I ask is that in Clojure, atom is a kind of software transactional memory reference type that has transactional semantics around mutability. Quite different to what is meant in Erlang or most other LISPs!

okeuday commented 9 years ago

@JulianLeviston yes, I mean what most LISPs/Scheme or Erlang mean by atom. Clojure's usage looks like atomic data access, not a unique identifier. I submitted this issue because the transit format could handle more types to make communication between other programming languages simpler by widening the scope to include types that do not always have a direct mapping into all the programming languages.

Support for tuples and atoms seems critical for Erlang usage, since often Erlang messages are tuples with an atom in the first element of the tuple, when using all Erlang source code. Also, when handling nested data, often tuples and atoms can be within a message. Other Erlang types like process pids, ports, refs, etc. are generally specific to Erlang, so I am not expecting transit to be a superset of all programming languages. It just seems advantageous to have atoms and tuples accessible, since they can be binaries and arrays, respectively, in programming languages that lack an atom and/or tuple type.

GetContented commented 9 years ago

Yes, I realise that's what you meant. I was actually trying to get you to think about the semantics of your request across many languages, which it seems you've already done, to a degree.

As Transit is primarily a cross-language serialisation and data communication library, what kind of data is this atomic "unique value"? In LISP, an atom includes numbers and the empty list / null value (IIRC)... which have completely different semantics in other languages - numbers for example, should be numbers in most other languages, not LISP's atoms. How would you store such a thing in a database, for example? Is it data, or is it of semantic value to the compiler (ie what meta-level does the information belong to). From my understanding Transit is primarily concerned with application level data semantics (ie "Widgets" and "Chocolate Bars" to sell in your store app), not code level data semantic (ie "Symbols" versus "Strings", and "Vectors" versus "Linked Lists"), but I could be wrong. Having said this, the extension mechanism in transit could easily extend to whatever you'd like it to, IIRC.

As to tuples, why can't you represent a tuple as an n-item list?

What is a "binary" ? Do you mean a boolean, or do you mean a pair of set type?

okeuday commented 9 years ago

@JulianLeviston An atom that is assigned as '1' (using Erlang syntax) should remain an atom, if the programming language supports it, or if the programming language has no atom concept, it should be raw data that contains '1', since the raw data can provide a unique identifier without the type support. I have looked a little at the transit impls and the current transit-format, which appears to show that atoms are provided with the semantic type name "symbol", with usage in ruby and class wrappers elsewhere. So, it appears that the current Erlang impl does not use transit symbol types as Erlang atoms, which is likely safest, due to Erlang atoms lacking any GC. So, it isn't fair to say transit doesn't support atoms, it does, just they are named symbols.

The tuple type is best as an array, since the size is static generally and it is quickly accessible by an integer index generally. Based on the other transit impls, it seems like programming languages without tuples would have a class wrapper with array usage inside. A tuple would need to be added to the transit-format as a new composite type, though it would generally be the same as the array composite type for programming language that lacked a tuple type. The python transit impl has arrays as tuples, but the Erlang impl would likely use the array module for an array while using a tuple type for tuples (currently, the Erlang impl is using an array as a list, which is bad performance-wise, when accessing the array data, while also being unnatural for array access).

I am not very familiar with the Transit format. This issue was filed after looking at it, because it summarized the shortcomings I perceived which would prevent me from using it. The reason for this judgment is that I would want to justify transit usage based on type support that msgpack does not provide for simpler programming language communication. I am mainly concerned with primitive programming language types (not the uri, uuid, time, etc. types) and this issue was written based on considering usage of transit in Erlang for communication with C/C++, Java, JavaScript, Perl, PHP, Python and Ruby, so that type information would be preserved when possible from Erlang data. The main current concern I have about the transit-format is the lack of a tuple type, since the atom type exists as a symbol.

timewald commented 9 years ago

Transit is designed to be extensible, so even though the spec does not define a tuple type, you can add your own. This is how all the built-in composite types other than map and array are implemented. See the section on extensibility, [1], in the spec.

Tim-

[1] https://github.com/cognitect/transit-format#extensibility

On Wed, Jul 29, 2015 at 10:38 PM, Michael Truog notifications@github.com wrote:

@JulianLeviston https://github.com/JulianLeviston An atom that is assigned as '1' (using Erlang syntax) should remain an atom, if the programming language supports it, or if the programming language has no atom concept, it should be raw data that contains '1', since the raw data can provide a unique identifier without the type support. I have looked a little at the transit impls and the current transit-format, which appears to show that atoms are provided with the semantic type name "symbol", with usage in ruby and class wrappers elsewhere. So, it appears that the current Erlang impl does not use transit symbol types as Erlang atoms, which is likely safest, due to Erlang atoms lacking any GC. So, it isn't fair to say transit doesn't support atoms, it does, just they are named symbols.

The tuple type is best as an array, since the size is static generally and it is quickly accessible by an integer index generally. Based on the other transit impls, it seems like programming languages without tuples would have a class wrapper with array usage inside. A tuple would need to be added to the transit-format as a new composite type, though it would generally be the same as the array composite type for programming language that lacked a tuple type. The python transit impl has arrays as tuples, but the Erlang impl would likely use the array module for an array while using a tuple type for tuples (currently, the Erlang impl is using an array as a list, which is bad performance-wise, when accessing the array data, while also being unnatural for array access).

I am not very familiar with the Transit format. This issue was filed after looking at it, because it summarized the shortcomings I perceived which would prevent me from using it. The reason for this judgment is that I would want to justify transit usage based on type support that msgpack does not provide for simpler programming language communication. I am mainly concerned with primitive programming language types (not the uri, uuid, time, etc. types) and this issue was written based on considering usage of transit in Erlang for communication with C/C++, Java, JavaScript, Perl, PHP, Python and Ruby, so that type information would be preserved when possible from Erlang data. The main current concern I have about the transit-format is the lack of a tuple type, since the atom type exists as a symbol.

— Reply to this email directly or view it on GitHub https://github.com/cognitect/transit-format/issues/29#issuecomment-126163845 .

Datomic Team

okeuday commented 9 years ago

@timewald Yes, I understand it is possible to make custom types. However, a tuple is a primitive type in some programming languages and I think it is important to have the programming languages that contain a tuple type, be able to utilize transit tuple data as a tuple. At the very least, this impacts Erlang and Python, I am not sure about all the programming languages that have a tuple type, but by adding a tuple type to the transit-format standard, all the implementations will agree, since that is the purpose of making a standard. You should be able to understand that since a tuple type is a primitive type, it needs to be included to be handled properly.

dchelimsky commented 9 years ago

it needs to be included to be handled properly.

What does "properly" mean? There's nothing special about extension types vs built-in types, so support for tuples would work the same way regardless of whether it's built-in or an extension.

GetContented commented 9 years ago

@okeuday I think you should look more closely at using transit. It feels a little bit like you've never used it. It's possible to implement all kinds of things... for instance, I implemented records - very easily... these are not built into transit, but they are built into Clojure. These are handled "properly". It's super trivial to implement whatever you need.

Tuple isn't a primitive in every language. The composite extension type IS the canonical way to implement any composite type you need, whatever the form. Note that you'll have to decide how this is implemented on either end, so if you're translating between, say, Ruby on one end, and Erlang on the other, you need to encode what Ruby should think the Tuple is. Perhaps you'll create a Tuple class, perhaps you'll use an Array... this is up to you, and your code requirements.

Transit is extensible in a similar way that XML is extensible, but it's vastly less verbose, and way more efficient, and translates into real language constructs, whereas XML doesn't - it's just text. You do still need to do some work on the extensible bits, though.

If you want to continue to argue your point, please do provide a concrete example of where not having first-class tuples would cause a problem that could not be mitigated with the extension type.

okeuday commented 9 years ago

@JulianLeviston You agreed that a composite extension type (I read this as "custom type", generally a user-defined type that lies outside the transit format) would require modifications on both ends, due to not being part of the transit standard (so, both Ruby usage and Erlang usage in your example would require custom type usage, due to not having a real tuple type, as mentioned in your example).

A concrete example of why this is problematic, ignoring the basic problem of not having a standard way of handling a tuple type (limit the scope of what transit can handle in a standard way), is that in Erlang, you are unable to use an array as an array using the array module and a tuple as a real tuple type. Currently, the Erlang transit implementation is using a list for an array type, which is inappropriate due to not providing efficient index-based access. A transit tuple can not be an Erlang array, since tuples are inefficient to set in Erlang. So, you could always convert an Erlang tuple to a transit array, but it wouldn't be helpful to have transit arrays as Erlang tuples (both Erlang arrays and tuples are index-based access, where arrays use tuples internally to provide more efficient access when setting array data, when compared to using tuples). It is better to have tuples and arrays as separate types, since their usage can differ.

This was part of examining transit to see if it can provide correct programming language exchange of data types and I am currently not convinced it is worth pursuing, because these issues with conversions of primitive data types appear to lack documentation and clear rules. For example, the documentation would benefit from having "atom" listed next to "symbol" to make sure both data types exist as the same thing, when possible. My hope is that transit is not clojure-centric, so clojure-specific usage of the word "atom" should not be a problem.

I would have expected transit arrays to be Python lists, instead of Python tuples, to make sure they are mutable, but that is a small point. The main problem that I see is just the lack of support for a tuple type in the transit standard, and it would prevent me from using transit for the source code I am thinking about. I am not sure about a better format for the purpose of attempting to preserve programming language type information, so being unwilling to modify the transit format is unfortunate. My view of what transit should or could be, is a format that provides exchange for all primitive data types, at least, and that view appears to be incorrect.

GetContented commented 9 years ago

@okeuday Seems to me you're complaining about the defaults of an extensible system.

okeuday commented 9 years ago

@JulianLeviston No, I am not, I am concerned about the lack of support for primitive data types, where the tuple type is my main current concern since I have not seen other missing primitive data types yet (after agreeing atoms exist named as symbols). Using a custom type requires changing both sides of transit usage, which for my purposes is impossible, since I would only be controlling one side of the transit usage. So, ignoring this concern about primitive data types would be ignoring a use-case.

GetContented commented 9 years ago

No one is ignoring them.

ohpauleez commented 9 years ago

@okeuday In your application, you are generating and exposing data (say, over some API of some sort), and unknown/random/uncontrolled third-parties are consuming the data?

Moving away from the discussion of primitive types, think about the semantics of the data you're conveying. Is there something specific about the semantics of the Tuple type you're trying to convey to consumers? If so, it most likely makes sense to use a custom type to capture that semantic intent. In some languages, that semantic intent could map to a built-in type, but in other languages, consumers will have to decide how they want to interpret the semantics - that's encouraged. Just food for thought.

okeuday commented 9 years ago

@ohpauleez Yes, that is correct. The specific semantics of a tuple type are specific to the tuple type as it would be found in a programming language, if it does exist, which should be able to be generalized as: a static sized type containing other types, allowing index-based access. I am not aware of a tuple type in a programming language that would contradict these semantics and I believe it is safe to use these generalizations when thinking about a tuple type.

These semantics are best conveyed as a tuple type, and due to controlling only a single side, it is best to handle it as a real type within the transit format. I have to also consider the question "What makes transit better than just using msgpack directly?" and that should be the larger number of types it is able to handle transparently, since I could easily remove the type information to utilize only msgpack. However, it would be better to preserve the semantics associated with the type information to provide more accurate data exchange between source code, in any of the programming languages transit supports.

It is hard for me to not discuss this as a primitive type problem, since I see that as critical to make sure the semantics of types are preserved across programming languages with usage of transit.

richhickey commented 9 years ago

a static sized type containing other types, allowing index-based access.

So is a Tuple2 a different type from a Tuple4?

If so, then Tuple is a family of types and thus not supportable as a base fixed-named type of Transit, or, the size is a dynamic property and the semantics are not different from the built in ‘array' type. Just pretend it is called ‘tuple'.

In any case, you’ve gotten a some good advice here on how to either map to array or use Transit’s extensibility to do what you need. There are no plans for a built-in Tuple type.

Regards,

Rich

On Jul 30, 2015, at 2:00 PM, Michael Truog notifications@github.com wrote:

@ohpauleez Yes, that is correct. The specific semantics of a tuple type are specific to the tuple type as it would be found in a programming language, if it does exist, which should be able to be generalized as: a static sized type containing other types, allowing index-based access. I am not aware of a tuple type in a programming language that would contradict these semantics and I believe it is safe to use these generalizations when thinking about a tuple type.

These semantics are best conveyed as a tuple type, and due to controlling only a single side, it is best to handle it as a real type within the transit format. I have to also consider the question "What makes transit better than just using msgpack directly?" and that should be the larger number of types it is able to handle transparently, since I could easily remove the type information to utilize only msgpack. However, it would be better to preserve the semantics associated with the type information to provide more accurate data exchange between source code, in any of the programming languages transit supports.

It is hard for me to not discuss this as a primitive type problem, since I see that as critical to make sure the semantics of types are preserved across programming languages with usage of transit.

— Reply to this email directly or view it on GitHub.

okeuday commented 9 years ago

@richhickey No, that just distracts from the discussion. Tuple is a single type, similar to an array, but different. This is discussed in paragraph 2 at https://github.com/cognitect/transit-format/issues/29#issuecomment-126397756 . If there are no plans to care, feel free to close the issue.