ceylon / ceylon-spec

DEPRECATED
Apache License 2.0
108 stars 34 forks source link

Add hexadecimal, octal and binary literals #382

Closed FroMage closed 11 years ago

FroMage commented 12 years ago

We can live without them, but damn that makes the code look silly. Most specs that deal with binary are defined in terms of either hexa, octal or binary numbers, which means that the code we write in Ceylon would need to have those numbers translated in decimal (first possible source of programmer error), and put the original number (hexa, octal or binary) as comments (which looks clumsy).

Case in point: UTF-8 decoding:

// 0b1100 0000 <= byte < 0b1110 0000
if(byte >= 192 && byte < 224){
}
// byte & 0x0F
Integer part1 = bytes.get() - 240;

Now, the syntax in Java is 0x prefix for hexa, 0 prefix for octal and 0b prefix for binary. I find the 0 prefix error-prone (if traditional) so I suggest we go for 0o (little "o" for Octal). I guess I'd also be fine with \x \o and \b.

The other question for those literals is that of postfix quantifiers and exponents. I just don't think they apply for those numbers, but hey, open to suggestions.

WDYT?

gavinking commented 12 years ago

I always assumed that this was a job for single-quoted literals.

FroMage commented 12 years ago

That's a possibility, but we don't know when/if those will be supported, and it looks weird that we go through all the trouble of supporting numeric literal postfixes and exponents and separators while overlooking a pretty basic and traditional feature such as this one.

I don't think '0x23f' is more readable than 0x23f either, so we're not winning by relying on single-quoted literals. What would be a good reason to not support them and rely on single-quoted literals?

gavinking commented 12 years ago

To be honest I have no clue wtf 0x has to do with base 16 numerals or how it came to be adopted by so many programming languages. I don't think that is a tradition we should continue.

Sent from my iPhone

On Aug 16, 2012, at 8:33 AM, Stéphane Épardaud notifications@github.com wrote:

That's a possibility, but we don't know when/if those will be supported, and it looks weird that we go through all the trouble of supporting numeric literal postfixes and exponents and separators while overlooking a pretty basic and traditional feature such as this one.

I don't think '0x23f' is more readable than 0x23f either, so we're not winning by relying on single-quoted literals. What would be a good reason to not support them and rely on single-quoted literals?

— Reply to this email directly or view it on GitHub.

tombentley commented 12 years ago

Could we special-case this particular use of single quoted literals in the typechecker, just for use with the following three functions (which would be defined in the language module of course):

Integer i1 = hex('ffffff');
Integer i2 = octal('23634');
Integer i3 = binary('1111');

(Obviously something somewhere optimizes them to normal literals).

gavinking commented 12 years ago

What about a literal format something like:

Integer i1 = 2AFF\F;
Integer i2 = 2363\7;
Integer i3 = 101011\1;

of course, \9 would be the implied default.

tombentley commented 12 years ago

That looks horrid to my eyes.

FroMage commented 12 years ago

To be honest I have no clue wtf 0x has to do with base 16 numerals

"x" stands for "heXadecimal". That's not really confusing.

I really don't like 2AFF\F because it looks like we're escaping the last F for some reason, and I think the base should come in front so my eyes can parse the thing as a hexa literal from the start and do redo the reading when I reached the end. Also it's not easy for me to associate the last digit in a base to the base itself. I mean, \9 doesn't really tell me this is base-10. It's certainly not obvious, even though it makes sense once someone explains what it could mean.

Also it's not consistent with our other uses of \ escapes which are prefixes (\i for example).

My favourites are still 0x, 0o and 0b, or \x, \o and \b.

FroMage commented 12 years ago

That looks horrid to my eyes.

Glad it's not just be being stuck in the past :)

gavinking commented 12 years ago

I doubt someone who didn't know that syntax would easily guess what it meant.

Hah, but someone who didn't know what 0xffff meant could easily guess?!

C'mon, admit it, the only reason that convention doesn't make you want to vomit is that you're used to it.

Having the \ in the middle of the token makes it looks like an operator.

Perhaps, though \ isn't an operator in any language I know.

I mean, I don't really care what the escape/separator character is, it could be \, @, #, $, :, ::, \\, ~, whatever for all I care. It could be postfix or prefix if you prefer.

But what seems to me to be the craziest possible choice for an escape character is a leading zero. Nobody (except a computer programmer) thinks a leading zero in a numeral is significant. The 0x or 0b convention sucks the "x" and "b" into the digit string itself, making it difficult to visually parse, since our eyes aren't accustomed to treating numeric digits or letters as punctuation. To me, the difference between 0o2363 and 002363 is extremely difficult to pick. I mean, it's just not at all visually obvious what is the first numeric digit in 0xffff.

Sure, sure, we've taught ourselves how to interpret this crap, but it's still crap.

"x" stands for "heXadecimal". That's not really confusing.

OMG, it's so non-confusing that in 30 years programming I never figured this out for myself. That's perhaps the most arbitrary thing thing I've ever heard. Why not "h"?

gavinking commented 12 years ago

Perhaps, though \ isn't an operator in any language I know.

Excuse me, that's not true. I should have said "isn't an infix operator in any language I know".

ikasiuk commented 12 years ago

Excuse me, that's not true. I should have said "isn't an infix operator in any language I know".

Now you know one: BASIC! ;-)

tombentley commented 12 years ago

Hah, but someone who didn't know what 0xffff meant could easily guess?!

I'm not particularly fond of 0x (I'm not advocating it), and I agree it isn't obvious unless you've met if before. The one thing it does have in its favour though is historical precedent: Almost any programmer who sees 0xffff will instantly recognize a hex int literal (even if they don't know what 0x 'means'). Surely catering to existing programmers with a poor but widely recognized syntax makes sense if the alternative is a new equally poor syntax that caters to pretty much no one.

What I would like is something like hex('ffffff');.

FroMage commented 12 years ago

OMG, it's so non-confusing that in 30 years programming I never figured this out for myself. That's perhaps the most arbitrary thing thing I've ever heard.

Frankly the number of things you never figured out in 30 years of programming is the puzzling and most arbitrary thing ;)

Why not "h"?

My guess is that the 'h' is silent?

The x may be arbitrary but it is the strong sounds in hexa. It's also a custom that is absolutely trans-language.

http://en.wikipedia.org/wiki/Hexadecimal#Written_representation lists a lot of representations. Most use x, though some like assembly use h. Surprisingly PostScript and Bash have an interesting solution: 16#ffee and by extension 8#644 and 2#1100, but I don't find that more intuitive or readable than the 0x or \x prefixes.

OK, so you don't like the leading 0, I can understand that. So \xff \o664 and \b1100 would work for you? That would be in line with the \u string escape sequence too.

chochos commented 12 years ago

I like the idea of stating an arbitrary base, but the 12345\5 does look ugly. In any case I prefer the PostScript 5#12345, and also stating the base, not the max symbol used (so 10#12345, not 9#12345)

gavinking commented 12 years ago

Now you know one: BASIC! ;-)

Haha, is it possible I never used integer division in my years of BASIC programming? :-)

My guess is that the 'h' is silent?

Hahahaha, true—in France at least :-)

So then this syntax is totally natural and intuitive to native speakers of French. Is that an argument for or against?

Surprisingly PostScript and Bash have an interesting solution: 16#ffee and by extension 8#644 and 2#1100.

This to me makes far more sense and is essentially similar to my proposal, except I think F is a better way to indicate base 16 than 16 because:

The second criticism also applies to 2#1100 and 8#644.

So \xff \o664 and \b1100 would work for you? That would be in line with the \u string escape sequence too.

But the \x and the ff still run together without separating punctuation.

FTR, I have always found the JavaScript convention of #AAF0F0 far more easy on the eyes than 0xAAF0F0. It's a pity that we also want binary literals, or I would just say go with the leading #.

tombentley commented 12 years ago

My problem with something like 8#644 is that I find it quite easy to gloss over the # in among all those digits, and so miss the fact that it's not a decimal integer.

Dare I suggest another alternative: x#644, o#644, b#101010?

gavinking commented 12 years ago

My problem with something like 8#644 is that I find it quite easy to gloss over the # in among all those digits.

Well, perhaps, but it's surely harder to gloss over an infix # than over a prefix 0, right?

And, FTR, I don't see any glossing over the F in F#110066, nor do I think it's possible to mistake 1#1010110011 for anything other than a binary literal. I agree that an octal literal is a little harder to recognize.

Dare I suggest another alternative: x#644, o#644, b#101010?

While it's clear that this works well enough for the cases we've identified, and might feel a little more familiar for many programmers, it wouldn't let people express numbers in any other base. Now, of course I recognize that bases other than 10, 2, and 16 almost never occur, but it still feels strange to choose something deliberately limited to four discrete cases when there are a couple of at least arguably-as-good option that covers any natural number.

FroMage commented 12 years ago

[Warning: this comment was written with humour in mind, read it as such]

Guys, the insanity has to stop right there.

Gavin should just be banned from having a say in this matter. After all, he admitted to being a complete stranger to hexadecimal litterals and he's likely never used on in his life.

Certain domains come with jargon: special language used to describe special things. This is why there are a hundred ways to call a rubber tube, depending on its size, use and what flows inside. People familiar with a domain are familiar with jargon.

People familiar with binary shit (streams, bytes, bits, bitsets, bitmasks, bitwise operations, CPUs, registers, character sets, encodings, protocols, network and all that shit) know precisely what the prefixes 0x, 0 and 0b stand for. They know it. If you don't know that, you don't need to know. Seriously.

Those people would probably understand what x# (though that's backwards to HTML/XML where it's #x) or \x mean, but they would start trying Ceylon by typing 0xff and see an error and go blog that we're so lame we don't support hexadecimal litterals. They would not (willing to bet here) even try something else, as that's just so common a convention that that's how we expect it to be. Same as if we renamed + to an add keyword/operator they would never find it in a million years.

Now, Gavin, never having used binary shit in his life, decided to reallocate the bitwise operators to sets. I find this personally insane, but I figured that if nobody else found it insane that must be me. Besides, I don't really mind calling methods on Integer that deal with bitwise operations, so I figure I'll let him do that and we'll see what comes out. Never in my life have I ever wanted to do things like xor or a negation on a Set (what does ~set even mean?), but hey, I figure Gavin must have had this need so often that he thinks it's more frequent than the same operators on bits. Whatever, I am ready to wait and see.

But F# or anything else that Gavin suggested here? That's bananas. That's crazy bat shit. That's square wheel. That's what would have happened to the world if Picasso had been an engineer. Nobody in their right might would find it by themselves, and presented with it, I would have said that's a bloody musical literal, whatever that is.

Not only that, but are we seriously going to support base-52 numbers? How the hell do you write one after the initial 52#?

it's a little perverse to indicate base 16 using a base-10 number

I hope this is a joke, right? Because how would you even guess the base if it's spelled inside its own base? What would a# mean? Even in math bases are expressed in decimal.

Hell, I'm pretty sure that in math people never use non-decimal numbers. Whereas people dealing with binary shit? All the fucking time. Every day. Seriously!

So if you go with something that isn't 0x we will have to parse it correctly and provide a meaningful error so that people can, well, not guess, but find it after trial and error. And a similar warning (or even error to be unambiguous) for 0 prefixes. Seriously. If we don't, people are never going to guess what our syntax is. And we want Ceylon to be familiar and easy, right?

Now, I already said I found the 0 prefix for octal error-prone, and besides octal numbers I've never seen in the wild outside of UNIX File modes, but even then I wouldn't rule it out as less frequent than hexadecimal literals out of hand, even though in our case we can definitely abstract those so that Ceylon users never have to deal with octal numbers. So for that reason I'm ready to break the convention and fix it so that there's a less error-prone way to deal with octal literals, such as 0o or \o or even o#.

Now, to get back to what bases we support, let's admit that non-decimal literals are only for binary shit, and so let's only support literals for binary, octal and hexadecimal. Hell if we want to get rid of octal and we never have to use it in our APIs (though using the interop/Java APIs will become slightly harder), then whatever: let's get rid of octal and support only binary and hexa. For all other bases, let's add a method String.parseInt(Integer base) or whatever.

Now, having said that, we already have an escape sequence and it's \ so \x and \b scream to be used. That's what's consistent. That's what's not going to shock people (though we still have to add meaningful parser errors for 0x, 0 and 0b). Let's just use that. Or 0x and 0b which is what people familiar with binary shit (the target demographics for those literals) and going to be expecting, so let's not fuck with them. We don't have to. We removed the octal ambiguity so those are clear. # is not a good choice, and we might end up using it for field references, we don't know yet.

Please let's stop being crazy here, and please Gavin let the binary shit to people that deal with binary ;)

chochos commented 12 years ago

Damn I've already been bitten by lack of bitwise ops yesyerday while trying to solve a Project Euler problem.

The bit about having to parse 0xff anyway just to give an error is so true...

ikasiuk commented 12 years ago

Stef nailed it. And this thread is absolutely hilarious, I love it :-)

gavinking commented 12 years ago

@FroMage Well, that's a nice enough rant, but I'm a little confused as to how I'm supposed to take it. If it's just a rant, well, funny, well done, and let's get back to the discussion of what is the best format for numeric literals in Ceylon.

But if it's more than humor, and I'm supposed to take seriously the points contained in it, then I suppose I would need to respond. Excuse me if that's not the idea, but here goes:

Gavin should just be banned from having a say in this matter. After all, he admitted to being a complete stranger to hexadecimal literals and he's likely never used on in his life.

In fact I use them all the time, for the same purpose that the overwhelming majority of developers using a high-level language use them: to represent colors using RGB. This is likely to be the most common use of hex literals in Ceylon, since Ceylon is a very high-level language, not really intended for pushing bits around. If you're planning on doing a lot of bit-pushing, you need a language like C, C++, Rust, whatever, which gives you proper direct access to memory, something we don't have in a language like Java, C#, Ceylon, JS, Smalltalk, Ruby, etc. Horses for courses.

Certain domains come with jargon: special language used to describe special things. This is why there are a hundred ways to call a rubber tube, depending on its size, use and what flows inside. People familiar with a domain are familiar with jargon.

This is a great argument, nicely expressed, that you should save up and keep for some other argument about some totally different topic. In this particular discussion it's like a poor lost 3 year old wandering around asking for its mummy. The problem is that, as proven by the wikipedia page you linked to above, there is absolutely no standard universally accepted notation or jargon in this area, and therefore we're forced to use our brains and make a choice for ourselves.

If we're looking for what the real hardcore bit-pushers use, then we're talking assembly and according to wikipedia that would be a postfix H or prefix $ for hexadecimal. (Neither or which appears to me to be a crazy notation, and either of which I would be perfectly happy with if hex were the only additional base you were asking for.)

On the other hand, if we want Ceylon to be like other high-level languages, there are a heap of precedents to choose between, including prefix # (css, modula-2), prefix & (BASIC) 16# (Ada, bash, postscript), 16r (Smalltalk, Algol) #x or #16r (Lisp).

Now, of course there is a strong tradition behind prefix 0x: unix shells (but conspicuously not the most popular one), C/C++/C#/Java, and even ML. If this format were "just or almost as good" as the other competing possibilities, then that would be reason enough to stick with it, given the popularity of these languages, and given that Ceylon is cut mostly from the same tradition.

But if it is indeed not the case that this is a "just or almost as good" syntax, then I think we should choose something better.

it's a little perverse to indicate base 16 using a base-10 number

I hope this is a joke, right? Because how would you even guess the base if it's spelled inside its own base?

Eh? I think we have a very well-known and well-defined ordering of the latin alphabet. I actually remember reciting it out loud in grade 1!

What would a# mean?

It wasn't a joke at all. A# would mean, of course, base 11.

Now, to get back to what bases we support, let's admit that non-decimal literals are only for binary shit, and so let's only support literals for binary, octal and hexadecimal.

See that's the fucking problem here. If you were asking for "just" hex literals, which are arguably somewhat general purpose, this would be something I could pretty easily rationalize. But no, you want to bloat out the numeric literal format with three (3) extra special-purpose thingys which are designed for use in a domain that Ceylon is not even an appropriate language for. I tried to rationalize the awful special-caseyness of this to myself by trying to generalize it to one slightly less special-purposey thing, and you ridicule the idea. Fine. Sometimes you can go to far in trying to abstract something, making it deserving of ridicule.

So fine, so now let me turn this around: if you can justify adding three new special purpose things here (including a special separate feature for such incredibly, incredibly rare and endangered creature as an octal numeral), then there are about 10 new types of literal that I have way, way more justification for adding. Things that developers would use orders of magnitude more often than a fucking octal literal: dates, times, URIs, module version numbers, regexes, cron patterns, etc.

So Stef, where do we draw the line? I understand that, fancying yourself a hairy-chested bit-pusher, you would find octal literals to add a little extra convenience. Well, I would personally find date and time literals extremely convenient, oh and then cron patterns would fit in very nicely with that. Shall we add them too?

(what does ~set even mean?)

FTR, it doesn't mean anything. It's a syntax error. x~y is set complement (subtraction), an operation which I perform all the fucking time, and which you probably do too, even though you might not conceptualize it like that.

gavinking commented 12 years ago

Now, I already said I found the 0 prefix for octal error-prone, and besides octal numbers I've never seen in the wild outside of UNIX File modes [snip] So for that reason I'm ready to break the convention and fix it so that there's a less error-prone way to deal with octal literals, such as 0o or \o or even o#.

Alright, stop trying to have it both ways: do you accept that there is justification for breaking the C tradition, or don't you? If you do accept it, which is the impression I get from this passage, then what's apparent is this: that while I don't deny for a second the claim that I'm totally batshitcrazy, apparently my batshitcraziness isn't actually relevant to the discussion at hand, and we are all looking for a reasonable format for non-decimal numeric literals, that doesn't necessarily follow the tradition of C. If that's the case, then chest-thumpy rants, while great for showing off all that bit-pusher chest hair, don't actually move the discussion forward.

Therefore I ask you to look at it from my perspective: while you may not share my aversion for bullet-pointy lists of special-case language features, you surely recognize that the language definition would get completely out of hand if I added a new special-purpose syntax every time everyone wanted a minor convenience for some special usecase?

quintesse commented 12 years ago

The only thing I find strange is choosing something like F# for the base 16 numbers. I think I would just prefer having the base defined in decimal, just as if you would call a conversion method and pass the base as an argument. Because if you really want a totally flexible system where you could write down a number in base 22 I'd rather read 22#100 than m#100 and having to figure out what the m stands for. If on the other hand we think we'll only ever use binary and hexadecimal (and possibly octal) we could just go for Stef's suggestion with \b and \x which wouldn't any new syntax to the language.

NB: I don't mind breaking C-tradition here, even though I think it's the first thing people will try when they want to try to write a hex number.

quintesse commented 12 years ago

With respect to bit operations, I think you underestimate how many binary protocols are written in Java (or other high level languages), Gavin, this is not something you can just wish away. And I think it's much more important to have those operations perform as efficient as possible (within the limits of the language) than set operations that will most of the time be much more expensive.

So actually I think the set operations are pretty cool and I hope you'll be right and people will come to love them, but I also think we need performant bit operations (so not only as method calls on a Byte class for example).

gavinking commented 12 years ago

The only thing I find strange is choosing something like F# for the base 16 numbers.

Fine, I thought it was nice to get the same number of characters as h# or H# or x# with a "non-arbitrary" notation. 16# is slightly more verbose. Not that I especially care...

Because if you really want a totally flexible system where you could write down a number in base 22 I'd rather read 22#100 than m#100 and having to figure out what the m stands for. If on the other hand we think we'll only ever use binary and hexadecimal (and possibly octal) we could just go for Stef's suggestion with \b and \x which wouldn't any new syntax to the language.

This is certainly a reasonable argument.

NB: I don't mind breaking C-tradition here, even though I think it's the first thing people will try when they want to try to write a hex number.

Sure, but I highly doubt that "write a hex number" is the first thing most people will try to do in Ceylon.

With respect to bit operations, I think you underestimate how many binary protocols are written in Java (or other high level languages), Gavin, this is not something you can just wish away.

Sure, but why would we rewrite this stuff in Ceylon?

And I think it's much more important to have those operations perform as efficient as possible (within the limits of the language) ... we need performant bit operations (so not only as method calls on a Byte class for example).

Certainly, but just because something is represented as a method call at the language level doesn't mean the compiler can't optimize it.

quintesse commented 12 years ago

Sure, but why would we rewrite this stuff in Ceylon?

It's not always rewriting, the moment we have a Socket people will start writing their own binary protocols. We could always go to Java for that but personally I'd want to prevent that as much as possible. Besides people thought Java could never be used for any of that stuff either and look in what kind of situations it's being used nowadays. I'm guessing only embedded real time systems using microcontrollers can't use Java. I would hope that one day with Ceylon we can do the same.

gavinking commented 12 years ago

So, after some reflection, what I think we should do here is just stick with my original plan for this stuff. If it's important enough to have support for hex/binary literals in Ceylon 1.0, then we should simply plan to add some level of support for single-quoted literals in 1.0. Specifically, I think we should support the following syntax:

value blue = hex '0000FF';

Where hex() is a toplevel function:

Integer hex(literal '^[0-9a-fA-F]+$' Quoted hexString) { ... }

The argument literal would be validated against the regex at compile time.

Of course, anyone who wants something less verbose can just use an import alias to write h'0000FF' or x'0000FF' or whatever. Which is actually the built-in literal format in some languages.

FroMage commented 12 years ago

In fact I use them all the time, for the same purpose that the overwhelming majority of developers using a high-level language use them: to represent colors using RGB. This is likely to be the most common use of hex literals in Ceylon, since Ceylon is a very high-level language, not really intended for pushing bits around. If you're planning on doing a lot of bit-pushing, you need a language like C, C++, Rust, whatever, which gives you proper direct access to memory, something we don't have in a language like Java, C#, Ceylon, JS, Smalltalk, Ruby, etc. Horses for courses.

I hope you're not selling down Ceylon as a sort of high-level language which is only good for mental masturbation? I'm pretty sure Java and C# is used in plenty of places where they can do low-level stuff like binary operations, and they do it well and as fast as lower-level C stuff. This is why Java has added things like memory-mapped IO and non-blocking IO with select over the years: because these low-level things can be abstracted slightly higher and Java people hate to rely on native calls to do the dirty jobs.

I sure hope Ceylon will excel in the same area, and frankly I don't see why not. At least, I see no good reason why we should declare it unfit for that.

Now, of course there is a strong tradition behind prefix 0x: unix shells (but conspicuously not the most popular one), C/C++/C#/Java, and even ML. If this format were "just or almost as good" as the other competing possibilities, then that would be reason enough to stick with it, given the popularity of these languages, and given that Ceylon is cut mostly from the same tradition.

But if it is indeed not the case that this is a "just or almost as good" syntax, then I think we should choose something better.

Strawman: I'm not saying it should be 0x or nothing.

What I am saying is that our parser will have to recognize this to be friendly, to help people discover our own syntax. Do you agree with that?

What would a# mean? It wasn't a joke at all. A# would mean, of course, base 11.

WRONG! ;) I asked about a#, which, according to my own convention of using Unicode-character-ordering for bases greater than 36 (0-9A-Z) is a base-37 number (0-9A-Za). What precisely is the convention for a base-345 number? Is ชิ้# the right notation?

So fine, so now let me turn this around: if you can justify adding three new special purpose things here (including a special separate feature for such incredibly, incredibly rare and endangered creature as an octal numeral), then there are about 10 new types of literal that I have way, way more justification for adding. Things that developers would use orders of magnitude more often than a fucking octal literal: dates, times, URIs, module version numbers, regexes, cron patterns, etc.

Wrong argument, and yes we will support those literals because they are also a problem worth solving, and the single-quoted literal thing is a great idea. But not supporting hex and binary numbers is a regression compared to C and Java. Perhaps we're OK with that, but I haven't seen a good reason thrown around yet why.

(what does ~set even mean?) FTR, it doesn't mean anything. It's a syntax error. x~y is set complement (subtraction), an operation which I perform all the fucking time, and which you probably do too, even though you might not conceptualize it like that.

OK, that's confusing to me. Intuitively I would have used - for subtraction. ~number in bitwise operations returns all 0 bits turned to 1 and vice-versa.

Look, I already admitted we can abstract away the single use I know of octal literals in Ceylon so I'm pretty sure we don't need them. Hexa and binary we should support. I don't care at all if it's 0x 0h or \x or \h because I find those sane notations. The first because it's traditional in the tradition that we care about (C, Java), the second because it fits with our other quotation notations \i and \u.

That's where I draw the line: a trivial feature to implement and not a can of worms. For the worms, let's use the single-quoted literals.

FroMage commented 12 years ago

If that's the case, then chest-thumpy rants, while great for showing off all that bit-pusher chest hair, don't actually move the discussion forward.

So what, you're the only one allowed to diatribe here? ;)

Therefore I ask you to look at it from my perspective: while you may not share my aversion for bullet-pointy lists of special-case language features, you surely recognize that the language definition would get completely out of hand if I added a new special-purpose syntax every time everyone wanted a minor convenience for some special usecase?

Come on. Binary and hex literals can hardly be called either of those things. You know it.

FroMage commented 12 years ago

So, after some reflection, what I think we should do here is just stick with my original plan for this stuff. If it's important enough to have support for hex/binary literals in Ceylon 1.0, then we should simply plan to add some level of support for single-quoted literals in 1.0.

I suppose I could agree to that, though I find the syntax confusing, for lack of parenthesis around the function call… And we could optimise it properly. I find it ironic that you would push the regex syntax (which I love) while you told me many times how much you hated it.

I still don't see why you think that numeric exponents are more important than hexa and binary literals, although I'm pretty sure that you decided they were to you, but I'm ready to let this one pass because I don't care enough.

What I'm convinced about though, is that if we don't support the traditional notation we must parse it and give appropriate error messages and quick-fixes.

FroMage commented 12 years ago

Oh, and I don't see why we can't introduce single-quoted literals right now in M4. What's the reason?

gavinking commented 12 years ago

I suppose I could agree to that, though I find the syntax confusing, for lack of parenthesis around the function call…

If everyone is happy with the verbosity of hex('6633AA') then that's also fine by me.

I find it ironic that you would push the regex syntax (which I love) while you told me many times how much you hated it.

Well, I do hate the regex syntax, and if we had something better I would propose we use that. (But it's not unreasonable to use a conceptual regex here—i.e. a scanner.)

Even worse, note that whatever mini-language we use here becomes implicitly part of the language definition itself. Which is more than a little scary, especially when you consider that this language isn't tied to the Java platform and its particular flavor of regex.

So perhaps we should say that the format needs to be defined in some very limited BNF with just character literals, the _ wildcard, .. ranges, |, ?, *, +, and grouping parens i.e. a slightly-Ceylonified version of ANTLR BNF. For example:

Integer hex(literal '(`0`..`9`|`a`..`f`|`A`..`F`)+' Quoted hexString) { ... }

Perhaps we would also need quantification like {3} and {0..2}. Anyway, it would surely be simple enough to translate this mini-language to a Java regex.

Oh, and I don't see why we can't introduce single-quoted literals right now in M4. What's the reason?

No reason, except that someone has to do the work and we have to argue over details.

gavinking commented 12 years ago

I find it ironic that you would push the regex syntax (which I love) while you told me many times how much you hated it.

On a tangent, it's worth revisiting exactly why regex is such an awful language. The problem is the basic decision to mix together characters and "meta"-characters without any kind of quoting. So:

  1. When you see a string of metacharacters and punctuation characters, it's very difficult to immediately distinguish which are metacharacters and which aren't. Quick: is ! a metacharacter? is ,? is :?
  2. To match a character which is a regex metacharacter (never an uncommon thing, since regexs very often need to match punctuation), you need to escape the character with \. Ugly. Then if you're embedding the pattern in some other language where \ is an escape character, you wind up with \\. But only for metacharacters, not for other punctuation characters!
  3. Since, contrary to every other language on earth, unquoted whitespace is significant, it's impossible to format the pattern nicely by spacing out the individual "bits", resulting in a totally unreadable string of mush.

And that's just for starters: we havn't even mentioned the problems associated with not being able to have subrules!

gavinking commented 12 years ago

So perhaps we should say that the format needs to be defined in some very limited BNF with just character literals, the _ wildcard, .. ranges, |, ?, *, +, and grouping parens i.e. a slightly-Ceylonified version of ANTLR BNF. For example:

I guess you also want string literals, to match multiple characters at once, for example, "http:".

ikasiuk commented 12 years ago

I suppose I could agree to that, though I find the syntax confusing, for lack of parenthesis around the function call…

I actually find hex'a6c1' better than hex('a6c1'). I guess there can be only one argument anyway and so it makes sense that the quotes take the place of the parenthesis. Also, we already allow a similar syntax for annotations.

gavinking commented 12 years ago

I guess you also want string literals

Finally, the following more advanced features might eventually turn out to be useful:

So the above pattern could be written as:

(digit|`a`..`f`|`A`..`F`)+

It's still going to be a total PITA to specify this much syntax and semantics as part of the language defintion (it's a whole new chapter) but it's a hell of a lot better than having to specify the whole semantics of Perl 5 regexes, or even worse just lamely link to wherever Java defines the semantics of its regex lib (I'm assuming there is a proper definition somewhere).

tombentley commented 12 years ago

I'm assuming there is a proper definition somewhere

FTR, I'm pretty sure there isn't, and moreover there are some very esoteric corners even in the Java regex impl.

As the spec currently stands, you have to specify a 'parser' when declaring a literal. The first thing that parser is going to want to do is break up the string into the various bits... so shouldn't we provide this BNF thingy as a module?

FroMage commented 12 years ago

http://perldoc.perl.org/perlre.html and http://www.pcre.org/ are the specs. Can we put the regex rewrite in another issue? We could start by supporting PCRE like Java to get hex literals now.

tombentley commented 12 years ago

But those aren't really the specs because the Java implementation is a rewrite with some documented differences in the JavaDoc, plus (I dare say) some undocumented differences.

FroMage commented 12 years ago

Sure, I didn't meant to say I knew the Java Regex specs. But the Perl and PCRE ones I know where they are ;)

gavinking commented 12 years ago

I'm assuming there is a proper definition somewhere

FTR, I'm pretty sure there isn't

That's sort-of what I was afraid of.

http://perldoc.perl.org/perlre.html and http://www.pcre.org/ are the specs.

Yeah, but:

  1. I don't feel like these are things we can just simply re-use by reference as part of the very definition of the Ceylon language. Perhaps that's too anal—we certainly reuse other things by reference, including, for example, the memory model and integer/floating point semantics of the platform we're executing on—but to me it feels like the syntax for specifying the format of single-quoted literals is something that really does need to be defined in the language spec.
  2. We would still need a truly conforming implementation of one of these specs in Java. I doubt that the Java regex package is truly conforming.

The problem is that the Perl 5 regular expression language is really a pretty complex language in and of itself, much more complex than what we need here.

Can we put the regex rewrite in another issue? We could start by supporting PCRE like Java to get hex literals now.

If the only thing we need is hex literals, I can just hardcode the format into the typechecker. Beyond that, it's a really straightforward job to write a little ANTLR grammar that translates some little BNF-like mini-language to a Java regex. This is a job that would take me like 2-4 hours, I suppose.

The thing is, that if we're going to start using single quoted literals for stuff now, I want to figure out in detail what is the future for single quoted literals, and how they're really going to work, so that we don't do anything we'll need to change later. For example, yesterday I pretty much managed to convince myself that we don't need the Quoted type, and that a single-quoted literal should just simply be a String literal, but without escapes like \n, \{XXXX}, etc.

FroMage commented 12 years ago

Sure, but again, this belongs in another issue, where I'll contribute to the discussion. The implementation might take you half a day but the discussion will take a lot longer.

We could start by hardcoding the format for hex/binary.

chochos commented 12 years ago

As a sidenote, bitwise operations are not the exclusive domain of bit-pushers; it's used in stuff as simple as comparing flags when using stuff such as java.nio. I think there was an interface with the Set operators but it was removed, which is sad because Integer could implement | and &, even ~ and of course ^ as bitwise operators (I don't remember right now if ^ is a Set operator though).

gavinking commented 12 years ago

As a sidenote, bitwise operations are not the exclusive domain of bit-pushers; it's used in stuff as simple as comparing flags when using stuff such as java.nio.

IMO, this is a terrible and extremely error-prone API to expose to users. There are much more extensible and typesafe approaches to this kind of problem, that are much more appropriate in a high-level language.

I think there was an interface with the Set operators but it was removed, which is sad because Integer could implement | and &, even ~ and of course ^ as bitwise operators (I don't remember right now if ^ is a Set operator though).

The issue with this was the variance of the type parameter of the Slots interface. At the time I was not able to resolve the typing issues here, but perhaps I would be able to now.

chochos commented 12 years ago

OK that was just an example off the top of my mind; there are several applications for bitwise operations. I use them in my ISO8583 library as well (which I'd like to port to Ceylon at some point), and they may be needed for several protocol implementations, data encoding, etc. It's not as low-level as one might think.

FroMage commented 12 years ago

Guys: until this matter is resolved, I will send hex->decimal requests to @gavinking for manual processing.

First batch: 41 E2 89 A2 CE 91 2E

'kthanx

FroMage commented 12 years ago

Oh, just stumbled upon this one in Java: 0x1l. Now good luck with that. That's pretty terrible.

FroMage commented 12 years ago

I added temporary support for hex/binary literals: hex('ff') and bin('1001'). The current parser doesn't support annotations on literals so we're stuck with a method call for now.

FroMage commented 12 years ago

So we now have a Binary interface (implemented by Integer) which has all the bitwise operators as methods and the Java backend optimises those method calls on Integer so they're fast as hell.

1.rightLogicalShift(1) and foo.and(hex('ff')) look considerably lame but at least they're fast.

Also I just noticed that without operators we can't support stuff like mask |= 1 << 3. Ooops, sorry I meant mask |= 1.leftLogicalShift(3)… :(

FroMage commented 12 years ago

Just noticed that while Java only allows 4 bases for literals, it accepts up to 36 for parsing operations: http://docs.oracle.com/javase/7/docs/api/constant-values.html#java.lang.Character.MAX_RADIX

So it accepts them from 2-36, which means numeric + alphabetical case-insensitive. That's a sane limitation and gets us out of the tarpit of base-345 numbers.

I suppose we could provide a syntax for numeric literals in that range. 16#ff, 36#az, but I'm not too keen on the # since it is used as the method reference symbol in Java 8 and we might end up using it for attribute references when we get to agreeing on that syntax.