chrivers / transwarp

Transwarp compiler - a python3 implementation of a Simple Type Format parser and renderer
GNU General Public License v3.0
2 stars 1 forks source link

Finalize syntax for version 1.0.0 #1

Open chrivers opened 8 years ago

chrivers commented 8 years ago

(continued from https://github.com/artemis-nerds/protocol-docs/issues/50)

All section names (enum, object, parser, etc) are going to be completely free-form! This means that we can just use the ones we think are the most descriptive. By this, do you mean that sections don't have a type?

No, rather that I think we will only need 1 section syntax. For example, you can very clearly see the similarities between "struct" and "object". If we make the arguments optional (or find another way to represent the data), there is literally no difference.

Then, the only difference is the name. This is used so the templates can say "give me the enum named foo" or "give me all object definitions". The following would be valid:

flumf BingiBongi
    This

    That
        stuff: really<there, are, no, restrictions>

        well: except_for<the_grammar_rules>

(ok, I might be tired, but I think you get the idea).

Since this is only markup, we can decide on a convention we like for the artemis protocol spec. Other projects might decide on other conventions, and so on. It's neither our duty nor place to make any such restrictions.

To clarify: This is NOT how the system works right now, but it's an idea I've been toying with from the beginning. I'm going to try it out.

The compiler doesn't care which stf file a definition comes from, so grouping it into different files is just to make it easier for us. Yeah, I thought that would be the case, it's just that all the files seem to be named by the types of sections they contain, and then there's this random enum.

True - it sticks out a bit. If we had free-form section names, we could name it something more appropriate. Like:

canonicalnames FrameType
    valueFloat        = 0x0351a5ac
    shipSystemSync    = 0x077e9f3c
    clientConsoles    = 0x19c6e2d4
    gmButton          = 0x26faacb9
...

Just curious, how does the parser decide which stf files to open? Does it just open all of the ones in a folder?

Yes, they're all collected into one big data structure, which is (at the moment) a tree of all defined sections.

Conceptually, this would be:

+ root
|
+ enums
| | AlertStatus
| | AudioCommand
| \ ...
|
+ structs
| \ Ship
|
+ packets
| + ClientPacket
| |  ...
| \ ServerPacket
|    ...
|
\ objects
  + Anomaly
  \ Base
  ...

This extends all the way into the fields, but I think this is enough ascii-art to get the idea across :)

I put it in parser.stf, because that's where it is used But AFAIK it's not actually used by the STF files at all, but instead read by the parser, which is why I think it should be a different type (all other enums are used by the program) - but I suppose this doesn't matter anymore with the free-form sections.

Sorry, "used" was a poor term here. It's where a human would read it, when writing that section ;-)

It's (textually) where it is referenced, so it made maintenance easier. But as I said, unless we plan on giving the files namespaces themselves (perhaps that's not a bad idea!), then it doesn't matter to the output.

but there's really no need for the stf itself to be. It's really just a markup language. Yeah - I suppose my stance on this project has been that we should make the structure language specifically to describe the Artemis formats, as it allows us to simplify some things in the documentation. But of course, there are pros and cons to each side.

I think there's very little (if any at all) to gain from taking Artemis-specific shortcuts. I'm not even sure what it would be?

I agree that if it constrains us from reaching a goal, we could revise the position. But I don't think that will be the case, since it works, today :)

Can you send me an email, and we can coordinate further? Sent :)

Received!

You know, that bugs me too. I think the best option would be to find a short, memorable name. That's what I tried to do with the other Artemis-related projects I made. Suggestions welcome, it's still in beta :) If you're wanting to publicise this and encourage people to use it for other projects, IMO a descriptive name for the format would probably be best (or at least a name that somewhat gives away what the format is for) - e.g. for a Star Trek layman, Isolinear Chips probably doesn't mean a whole lot (although in this case that's probably fine as it's not really meant as a major public project).

Yeah, I allowed myself a little nerding out on the naming there, since the target audience is still the Artemis community.

Then again, except as an example, there are no ties from isolinear chips to the compiler, so I don't imagine people who are not looking for the artemis spec will bump into it, in the future.

So this would be:

object Base: BITMASK_SIZE = 2

# Name (bit 1.1, string)
#
# The name assigned to this base. In standard, non-custom
# scenarios, base names will be unique, but there is no
# guarantee that the same will be true in custom scenarios.
name: string

# Shields (bit 1.2, float)
#
# The current strength of the base's shields.
front_shields: f32

Oh gosh, now we have equals signs and colons in the same sections... I get that it's to differentiate between constants and types, but it also seems very easy to mix these up while writing - perhaps there's a better way to separate these? (e.g. surround a section of constants with %, although this might be a bit ugly for enums)

Well, one simple rule could keep this controlled: All constants must come before all types. That would catch the majority of oops-my-typing kind of errors.

We really need some way to decorate the sections with non-body information, but here's another possibility:

typename SectionName(param1=value1, param2=value2)
    foo: u32

This could work, too. I'm afraid it could get unwieldy though, and we lose the generality of it. It's going to be quite hard to parse this form in more than one line, and it could lead to excruciatingly long lines.

We also can't forego the names (like we do on types), since we want optional values. For example:

struct Ship210
    _max_version = 210

struct Ship240
    _min_version = 240

This would be a fairly clean way to add versioning information to sections.

there's no way to tell if, for instance, "f32" is the name of a struct, a built-in primitive, or something completely different? Couldn't you just disallow using those names? It seems to me this is a bit like using a function in a language that's defined in the standard API, as opposed to one you've defined - sections could just be thought of as types defined in the program. I guess this depends on how you parse the format.

But that's the point - right now there are no defined types!

The type parser literally does not care what you write, as long as it is within the syntax. This allows the templates (and thus, the end-user project) to come up with a type description they like.

To clarify - we certainly could add a list of standard type names (u8, u16, u32, f32, string, etc.. ) and then ban those, but it doesn't really solve the problem.

For example, "ConsoleStatus" could refer either to an enum, or to ServerPacket::ConsoleStatus.

We have to find a nice unambigous way to point to places in the namespace.

I agree that the current solution isn't optimal, but stripping away just "struct" seems odd, and quite arbitrary. It also in a very real way makes the templates more complicated to write. Either that, or we need to have opinions about what constitute "standard types", but I don't like that.

It turns out the artemis protocol is so bloody inconsistent that this isn't true! Oh wow, okay... It definitely wasn't like that back when I was working on my old parser, but clearly I haven't kept up-to-date with the documentation. After I read this I was considering pitching the idea of having enums specify a 'base type' and then places where the enum is used that don't use that type would provide their own, but I think that'd get far too annoying to work on and probably complicate the compiler considerably.

Agreed! The current solution is not perfect, but it's the least-bothersome one I could find on short notice :)

If you don't mind me asking, do you intend to implement a templating system/some kind of generator as well? Certainly, more implementations are better than fewer, but isn't it a little bit of reinventing the wheel we just made? :) Potentially, I'm not yet sure - I may end up making a packet parser that reads these structure files and 'interprets' them on-the-fly (i.e no code gen necessary). Also, since we want other people working on Artemis-related projects to be able to use the format, I thought it'd be a good test to make sure people other than you can write a parser that makes sense ;P

That's certainly an amicable goal - perhaps we should split the grammar and parsing portions into a separate project once we agree on a version 1.0 syntax.

Regarding the compiler, I can only say I was surprised by how long it took to go from working compiler (which didn't take long at all), to polished ready-to-run tool. I'm thrilled to see where we can take this next, and I hope we can all work to improve the syntax and the tools for everybody :)

chrivers commented 8 years ago

Ping @mrfishie @rjwut @IvanSanches @NoseyNick :)

NoseyNick commented 8 years ago

Just chiming in with my 2 cents:

  1. The objects don't NEED to tell you how many bytes are in the bitfield, you CAN (and my Perl parser DOES) calculate it from how many optional items are in it. An object has 10 attributes, that is >8 but <16 so needs 16bits = 2bytes for the bitfield. That said, I can see why some languages/parsers may benefit from knowing that in advance.
  2. This feels a bit like the CRCs for PacketTypes, worth mentioning HOW they are calculated, but I'm not 100% certain we should FORCE people to (re)implement JamCRC vs just having a pre-calculated list. I'm willing to be persuaded in either direction though
  3. (I said 2c but I always give 150%!) Another example of incinsistent ENUMs, ConsoleType - Int (u32) in SetConsole, Byte (u8) in ConsoleStatus, but an Int (u32) WITH a +1 offset in GameMaster Messages (GRRR)
chrivers commented 8 years ago

Hey nick, welcome to the new discussion :)

1) ..nope ;-)

The EngineeringConsole object contains exactly 24 objects (3 bytes), but has a bit mask of 4 bytes. There's just no way to know this.

You're right that we can calculate the minimum size, but that's not similar enough here.

Of course, we could add an "unknown" field with an "unknown" type just to force the bitfield to be 4 bytes, but that feels.. dirty :D

2) I agree that forcing people to implement JamCRC to use the docs would be needlessly complicated.

The current style is both an attempt at using the canonical names for something, as well as an attempt at balancing the use of symbolic names and endless "random" (looking) hex digits. I'm open for suggestions, certainly :)

3) :D

Yes! It's driving me up the wall too. The "0 or +1" style is actually used at least 2 places, and I think it would much cleaner just to give it a markup. For example, console_type is currently:

console_type: option<enum<u32, ConsoleType>>

if we made this:

console_type: nullable<enum<u32, ConsoleType>>

Then we have clearly marked which fields this goes for, as well as the size.

I agree that it be { 1. easier 2. more logical 3. less rage-inducing } if we could bind the encoding (u8/u32) to the enum type, but the protocol simply doesn't allow that. Tough noogies.

NoseyNick commented 8 years ago

RE Combining Structs and Objects: Agreed, they are VERY similar, and in my perl I subclass/superclass the two. The big difference being the bitfields for optional later-bits. Would it be possible/practical/silly/good/bad/ugly to have something like:

struct ThingObj
   _bitfield : sizedarray<u8, 3>
  # the bitfield in this obj is 3 bytes long, for an object with 9-to-12 attributes

  foo : optional<0, u32>
  # attribute "foo" is an optional u32.
  # Its presence/absence is indicated by bit 0 in the _bitfield

  bar : optional<1, u8>
  baz : optional<2, string>
  ...
  wibble : optional<10, f32>
  # attribute "wibble" is an optional f32, see bit 10 in the _bitfield

If this is a bit too artemis-specific, maybe a more generic

 optional<condition, type>

where "condition" is a bit more of an expression, maybe "_bitfield&0x01", but maybe expressions are too language-specific :-/ Come to think of it, perhaps "parser" is also a struct with optional bits, except...

struct ServerParser
  FrameType: enum<u32, FrameType>
  shipSystemSync : optional<FrameType=shipSystemSync, ServerPacket::EngGridUpdate>
  clientConsoles : optional<FrameType=clientConsoles, ServerPacket::ConsoleStatus>
  ...

[edited a few times to fix markdown syntax, and u32 enum, sorry]

NoseyNick commented 8 years ago

The EngineeringConsole object contains exactly 24 objects (3 bytes), but has a bit mask of 4 bytes. There's just no way to know this.

Hmmm, I wonder why MINE has 32 attributes, matching the 4 bytes:

    0x03 => ['EngCons',
        BeamsHeat=>'f<', TorpsHeat=>'f<', SenseHeat=>'f<', ManuvHeat=>'f<',
        ImpulHeat=>'f<', DriveHeat=>'f<', FShldHeat=>'f<', AShldHeat=>'f<',
        BeamsEner=>'f<', TorpsEner=>'f<', SenseEner=>'f<', ManuvEner=>'f<',
        ImpulEner=>'f<', DriveEner=>'f<', FShldEner=>'f<', AShldEner=>'f<',
        BeamsCool=>'C',  TorpsCool=>'C',  SenseCool=>'C',  ManuvCool=>'C',
        ImpulCool=>'C',  DriveCool=>'C',  FShldCool=>'C',  AShldCool=>'C',
        BeamsUnkn=>'V',  TorpsUnkn=>'V',  SenseUnkn=>'V',  ManuvUnkn=>'V',
        ImpulUnkn=>'V',  DriveUnkn=>'V',  FShldUnkn=>'V',  AShldUnkn=>'V',
    ],

Am I more up-to-date than the docs, or am I behind the docs, or were you referring to some other object, or... something else? [ never mind, I should read my own code before I paste. My last 8 are dummies, perhaps to make it 4 bytes. Checking my entire corpus, I find exactly ZERO instances of those beits being set / fields being sent :-( ]

chrivers commented 8 years ago

@NoseyNick I agree, we definitely have to merge the syntax in some way :)

Actually, your example is not bad - and it's entirely valid within the current syntax! I don't personally like the optional<bit, type> syntax, I think it's a bit too verbose (and error prone), but it's entirely valid!

One of the big advantages right now, is that it's possible to verify a new layout of bit masks just by shuffling lines around, and regenerating the templates. You would lose that advantages, if you had to manually update all lines in the object :)

we could definitely express the parser that way! Just remember, it's important we don't call it a "struct", since it will then be nearly impossible to know that it's not the same "kind of thing". For example, it would then end up in the "structs" table in the documentation. Maybe something else than "parser" if people are not fond of that word, but not something we already use :)

Regarding the 4 bytes mask - that was one of the things I fixed in isolinear, that I didn't get around to making a HTML PR for yet. So if you read it in isolinear, it's because I fixed it there :)

chrivers commented 8 years ago

This is going a bit off-topic, at least in places.

For questions purely about the Artemis protocol, please open an issue on isolinear:

https://github.com/chrivers/isolinear-chips/issues/new

For everything related to syntax, parsing and using stf, that belongs here no problem :)

chrivers commented 8 years ago

btw, I'm currently working on turning the existing index.html into a template, to show how it could be done, and to serve as an example:

https://github.com/chrivers/protocol-docs/tree/transwarp

Not a huge amount of progress yet, but a little getting-started guide, and a few converting items so far :)

mrfishie commented 8 years ago

We definitely need to merge the syntax, but I'm not really a fan of optional</> (what do we call these, by the way? functions?) - it will get very messy fast.

For syntax merging, perhaps some kind of 'or' syntax (YACC/BNF style) would be useful? Stealing YACC grammar(ish), here's an example:

ServerParser: FrameType::shipSystemSync(u32) ServerPacket::EngGridUpdate
            | FrameType::clientConsoles(u32) ServerPacket::ConsoleStatus
  ...

Of course we'd need a better way to represent this syntactically (the syntax above isn't compatible with the current syntax and also probably isn't compatible with documentation), but this would definitely increase the flexibility of the language. Unfortunately it would also probably increase the complexity of resulting code (you'd need to effectively create an LR parser).

chrivers commented 8 years ago

@mrfishie those are just parameters.

Type parameters have no pre-specified meaning, they simply describe a tree structure. Very much like an alternative syntax for S-expressions, actually.

I'm definitely a fan of BNF, but I think that's a little overkill in terms of complexity. Right now, we have a solution that might have rough edges, but it actually does work :)

Perhaps we are going about this the wrong way. Let's consider what kinds of information we want to store, and what goals we have. Once we agree on these, I think it will be very easy to finalize.

Data:

Goals:

What do you think about this? I feel we can get there with modifications to the current design. What do you think?

chrivers commented 8 years ago

@mrfishie a clarification on types.

It really is just a compact tree structure, and nothing else. There's no special meaning attached to it. Perhaps there should be, so the compiler can prepare the data more for the templates?

For example, someone could write machine<spring, specialnumber<1337>, lever<color<red>>>.. that's an extreme example, but this could then be used by the templates for whatever purpose.

mrfishie commented 8 years ago

Very much like an alternative syntax for S-expressions, actually.

S-expressions!!

Maybe we should just use those.. I think (magic (machine spring 1337) (lever (color red))) is far more readable than magic<machine<spring, 1337>, lever<color<red>>>, don't you?

/s

(Side note: your extreme example has an unmatched right angle bracket)

I'm definitely a fan of BNF, but I think that's a little overkill in terms of complexity. Right now, we have a solution that might have rough edges, but it actually does work :)

Yes, it is definitely overkill. I was just trying to get some discussion going on potential branching methods.

Perhaps there should be, so the compiler can prepare the data more for the templates?

Like some kind of set of inbuilt types that are processed by the compiler?

chrivers commented 8 years ago

Very much like an alternative syntax for S-expressions, actually. S-expressions!!

Maybe we should just use those.. I think (magic (machine spring 1337) (lever (color red))) is far more readable than magic<machine<spring, 1337>, lever<color>>, don't you?

/s

:D

(Side note: your extreme example has an unmatched right angle bracket)

Oops! Thx, I fixed that now :)

I'm definitely a fan of BNF, but I think that's a little overkill in terms of complexity. Right now, we have a solution that might have rough edges, but it actually does work :) Yes, it is definitely overkill. I was just trying to get some discussion going on potential branching methods.

Definitely!

As I see it, we can basically go in 2 directions. Either we go all out, and make a full BNF description, and a real import system, type rules, etc.

Or, we find a more down-to-earth approach, and live with a more soft type system and validator. It will then be more up to the user to employ good practices, to keep the system in check.

Personally, I'm almost always a proponent of the first solution, but I honestly think it might be overkill here.

It is enticing, though. If we could define the meta-structure of object types in the language, and then declare object of that type, we could enforce order on the code.

For example, and this is purely a thought experiment:

## Declare primitive types, that can be references as-is without error.
$primitive i8, i16, i32, i64;
$primitive u8, u16, u32, u64;
# and so on...

## Declare a block type. Here we declare that enum always has a name, and never takes arguments
$declare enum $ident()
  ## Only one type of content: ident -> int literal mapping
  $ident: $int

## Here we declare objects. They have a name, and must take a maskbytes=$int argument
$declare object $ident(maskbytes=$int)
  ## Objects are ident -> type mappings
  $ident: $type

## Parsers take a read=$type argument
$declare parser $ident(read=$type)
  ## ..either int -> type:
  $ident: $type
  ## ..or a referenced constant (like FrameType::foo) -> type:
  $const: $type

Now, this is very, very rough of course. But that's the direction we could go in. It would be a very interesting tool, but it's certainly a completely different scope than the original idea :)

Perhaps there should be, so the compiler can prepare the data more for the templates? Like some kind of set of inbuilt types that are processed by the compiler?

I'm not a fan of built-in types, but they could be marked as being recognized, in the source file - like with the primitive declarations above. That way, the compiler could decide at compile-time if a reference is known or not. That would be a nice feature.

Otherwise, every spelling error just becomes a reference to a new "kind" of primitive, which of course doesn't exist (or even worse, does, but by coincidence)