Convert protocol description to structured data format, generate everything from that

artemis-nerds / protocol-docs

Unofficial documentation for the Artemis network and file protocols, written by the Artemis community

https://artemis-nerds.github.io/protocol-docs/

MIT License

8 stars 5 forks source link

Convert protocol description to structured data format, generate everything from that #50

Open chrivers opened 8 years ago

mrfishie commented 8 years ago

Here's the XML documentation format I started working on a while ago. I think it should be pretty self-explanatory, but I can provide some more info on what's going on if required.

chrivers commented 8 years ago

Definitely a good idea! Now, it's nothing personal but I don't think I can handle xml for one more nanosecond :D

Humans are going to be writing and maintaining this spec, so the format should be easy for humans to work with. XML completely fails that criterion.

I'm thinking either a modern light-weight format like {JSON,YAML, TOML}, or a simple home brew that I can quickly whip up a parser for. It's not rocket surgery after all :)

I'll see if I can make an example later

mrfishie commented 8 years ago

I wouldn't say that XML completely fails the criterion (sometimes it can be useful, especially when paired with a good editor... or maybe that's just me), but yes, I agree.

Going with the 'easy to work with' criteria, I'm not sure JSON would be the best to use - it gets ugly fast, and I find can be difficult to read (an alternative would be CSON, however I'm not sure how well supported it is). I've previously found YAML to be a bit too loose syntactically for my liking (but again, maybe thats just me and I just prefer strong nested structures like in XML), but IMO its definitely one to consider. I've never actually used TOML before (just had a look at it then), it looks interesting and could potentially work well if we do it right.

As for a custom format, that definitely could work. Since part of the goal of this project is to allow software to automatically generate protocol parsers from the docs, however, we might want to avoid this as it will add overhead to implementing a parser for the documentation (whereas parsers for JSON/YAML/TOML would probably be available for most popular languages).

I'll try to fiddle around with potential structures that I think could work with a few of these options sometime in the near future so we can compare how they'd look.

chrivers commented 8 years ago

I agree that CSON is slightly better than JSON, and I'm also worried about the getting-ugly-fast potential of JSON. I'm much more worried about XML, however ;) fairly I think a custom format is really pretty easy. We have quite simple needs, but some of them (like protocol versioning, and mask-bytes) are hard to express in existing formats. Here's my idea for a really easily readable format:

enum MainScreenView
  Forward                        = 0x00
  Port                           = 0x01
  Starboard                      = 0x02
  Aft                            = 0x03
  Tactical                       = 0x04
  LongRange                      = 0x05
  Status                         = 0x06

enum ObjectType
  EndOfObjectUpdatePacket        = 0x00
  PlayerShip                     = 0x01
  WeaponsConsole                 = 0x02
  EngineeringConsole             = 0x03
  PlayerShipUpgrades             = 0x04
  NPCShipEnemyOrCivilian         = 0x05
  Base                           = 0x06
  Mine                           = 0x07
  Anomaly                        = 0x08
  Nebula                         = 0x0a
  Torpedo                        = 0x0b
  BlackHole                      = 0x0c
  Asteroid                       = 0x0d
  GenericMesh                    = 0x0e
  Creature                       = 0x0f
  Drone                          = 0x10

This is without comments, so far, but that would easily be added. Here's a preliminary example of the client packets:

protocol ClientPacket
    packet AudioCommand
        # The ID for the audio message. This is given by the
        # IncomingAudioPacket.
        audio_id: i32

        # The desired action to perform.
        audio_command: enum8<AudioCommand>,

    packet CaptainSelect
        # The object ID for the new target, or 1 if the target has been cleared.
        target_id: i32
...

And so on. The types I imagine, are the following:

u8, u16, u32, u64 (unsigned integers)
i8, i16, i32, i64 (signed integers)
f32 (32-bit floats)
String
enum8<Type>, enum32<Type> (8- or 32-bit enum of specified type)

Almost the entire packet list can be specified like this. We also need to specify certain structs (for example, ships). It is a similar process:

struct Ship
    # Whether the ship has warp or jump drive
    drive_type: DriveType

    # ID from vesselData.xml
    ship_type: u32

    !version: min(2.3)
    accent_color: u32

    __unknown__: u32

    # The name of the ship
    name: String

A couple of points:

1) This type system in general avoid a lot of the ambiguity about "int" and "enum". Sizes are always specified explicitly.

2) It's really simple to write a parser for this. I could write a small python script, that generates the documentation from the database, if we agree to go in this direction.

3) Here we see the "!version" specifier. It marks the next field with the meta-information for which versions are valid. Since this is just a data structure, different output generators can do different things with the version info. For example, the documentation would use it to mark fields visually, but a protocol generator might use it to do conditional parsing of structures, or generate multiple versions, for each protocol version.

4) The name "unknown" is reserved, and will be used only for unknown fields. We can then do any kind of custom filtering we like (protocol code might just skip bytes, instead of storing them, etc)

I haven't talked about bitfields, static/dynamic arrays, and delta objects, but I have a plan for those too.

What do you guys think of this syntax? Anything I missed?

mrfishie commented 8 years ago

This is looking pretty good, but I'm not really sure I like the idea of including version data in the documentation - it seems like it could make some packets get very messy, and (depending on how its done) might not be powerful enough for some changes through versions (e.g. the packet ID of something changing between two versions).

Depending on the use for version metadata, I would propose we simply use Git tags to keep track of versions of the documentation for each Artemis version. Of course, this means that packet parsers that behave differently across different versions (i.e they change the version of the protocol they use based on the version used by the other end) would be more difficult to create, but I'm not sure that's necessarily an important thing as, as far as I know, Artemis itself isn't backwards compatible protocol-wise.

One more thing: you didn't include packet types/subtypes in the packet lists. I'm guessing those could look something like this?

    packet CaptainSelect 0x4c821d3c 0x11
        # The object ID for the new target, or 1 if the target has been cleared.
        target_id: i32

I feel like this syntax/structure is simple enough that someone else is bound to have made some kind of markup language which would fit it, in which case I would feel more comfortable using that as its one less thing we, and implementors, have to create and maintain.

One more thing: as far as I know, the enumerations are always used with the same type - this means that the enumeration type can be stored along with the enumeration, rather than where it is used.

mrfishie commented 8 years ago

Just did some experimenting with YAML, and came up with this as a potential syntax option (mostly based on your examples) - as you can see, its surprisingly similar to your syntax in many ways (although there are some awkward bits, like defining types and subtypes).

YAML also has a thing called 'anchors', which effectively allows a value to be named and then re-used later on - we could potentially use this for structures or enumerations, like I've shown below.

---
enums:
  MainScreenView: &enums.MainScreenView
    - type: i32
    - Forward:                  0x00
      Port:                     0x01
      Starboard:                0x02
      etc
  ObjectType: &enums.ObjectType
    - type: i32
    - EndOfObjectUpdatePacket:  0x00
      PlayerShip:               0x01
      WeaponsConsole:           0x02
      etc

structs:
  Ship: &structs.Ship
    drive_type: *enums.DriveType
    ship_type: u32
    accent_color: u32
    unknown: u32
    name: string

client_packets:
  AudioCommand:
    - type: 0x6aadc57f
      # The ID for the audio message. This is given by the IncomingAudio packet.
    - audio_id: i32
      # The desired action to perform.
      audio_command: *enums.AudioCommand
  CaptainSelect:
    - type: 0x4c821d3c
      subtype: 0x11
      # The object ID for the new target, or 1 if the target has been cleared.
    - target_id: i32

I'm going to try out some other markup languages to see if I can find something that works well (becaue as mentioned previously I would prefer something where someone else has already done/is doing the work to maintain a markup language), but if not then I definitely like the layout and syntax of your idea.

mrfishie commented 8 years ago

Side note: I didn't include the version metadata in the code sample above (for the reasons I outlined above), but I believe YAML may have a feature that can do that - I'll see if I can figure something out.

mrfishie commented 8 years ago

Here's an example of what versioning could look like, using YAMLs tag feature:

structs:
  Ship: &structs.Ship
    drive_type: *enums.DriveType
    ship_type: u32
    !min=3.2 accent_color: u32
    unknown: u32
    name: string

(the !min=3.2 is the version tag)

The syntax of what goes in the tag would be completely up to us, for example we could use min=3.2, >=3.2, etc.

mrfishie commented 8 years ago

I'm going to try out some other markup languages to see if I can find something that works well

I haven't had a lot of luck with TOML, and I don't think JSON/CSON will provide enough features to result in a compact structure.

Here's another crazy idea I had though: S-Expressions. While this would require a custom parser (although parsing s-expressions isn't really hard), and probably isn't particularly practical, I thought it might be interesting to play around with:

(artemis
    (enums
        (MainScreenView i32
            (Forward 0x00)
            (Port 0x00)
            (Starboard 0x00))
        (ObjectType i32
            (EndOfObjectUpdatePacket 0x00)
            (PlayerShip 0x00)
            (WeaponsConsole 0x00)))
    (structs
        (Ship
            (drive_type (enums DriveType))
            (ship_type u32)
            (accent_color u32 (min_version 3.2))
            (unknown u32)
            (name string)))
    (client
        (AudioCommand 0x6aadc57f
            (audio_id i32)
            (audio_command (enum MainScreenView)))
        (CaptainSelect 0x4c821d3c 0x11
            (target_id i32))))

But I digress. It looks like currently the only two real potential options are @chrivers' custom syntax and YAML - out of these, syntax-wise, I prefer the custom syntax, however it does come with the issues I've discussed above. @rjwut, what do you think?

rjwut commented 8 years ago

I think that if we can make an existing data format work without too much difficulty, we should, so as not not have to maintain a custom parser. Of what's been shown so far, I like the YAML example the best. While I'm a fan of JSON and generally not a fan of syntactically-significant whitespace, I can see why YAML might be a good fit for this project. I was experimenting with a JSON implementation just to see what it would look like, but I agree that the anchor and tagging features of YAML would be useful for this. I'm not sure that I like using comments for descriptions; I get that it makes sense in that they are intended for humans and therefore not useful to code generators, but the documentation generator will need it, so it's actually data, not just a comment. Would a YAML parser throw the comments away, or are they accessible in the resulting data structure?

chrivers commented 8 years ago

Lots of good comments here. To be honest, I think a few points have been overlooked :) it's late here, but I'll send a proper reply tomorrow

mrfishie commented 8 years ago

Would a YAML parser throw the comments away, or are they accessible in the resulting data structure?

That's a good point - I doubt the comments will remain accessible. One potential alternative is to use tags for the property type, and the value would be the comment, like this:

client_packets:
  AudioCommand:
    - type: 0x6aadc57f
    - !i32 audio_id: The ID for the audio message. This is given by the IncomingAudio packet.
      !enums.audioCommand audio_command: The desired action to perform.

However this prevents us from using anchors, and I'm not sure if multiple tags (for versioning and property type) is supported. Another issue I realised is that, in most cases, YAML parsers wont put fields into an ordered structure (e.g. in Javascript, the field list would be an object, which has no defined order). This could be fixed by using a list, but it does seem a bit error-prone:

client_packets:
  AudioCommand:
    - type: 0x6aadc57f
      # The ID for the audio message. This is given by the IncomingAudio packet.
    - audio_id: i32
      # The desired action to perform.
    - audio_command: *enums.AudioCommand

Using this style, maybe we could move comments in to the list items, like this?

client_packets:
  AudioCommand:
    - type: 0x6aadc57f
    - _: The ID for the audio message. This is given by the IncomingAudio packet. 
      audio_id: i32
    - _: The desired action to perform.
      audio_command: *enums.AudioCommand

YAML requires all items in a map to have a key, hence why I'm using _ here. What do you guys think? Any alternative ways to include comments that I've missed?

mrfishie commented 8 years ago

Since we now know the canonical names for each packet type, I'm guessing we want to replace the type property with the new canonical names, instead of using the integer types. Thoughts?

chrivers commented 8 years ago

Hey guys, I've been a bit busy. I'll try to do a proper writeup of my thoughts on this, asap. The current suggestions are not terrible, but there are some crucial pieces of information missing. Give me a day or two! :)

chrivers commented 8 years ago

Hey guys

Sorry for the last reply being kind of vague and ominous - I was a bit pressed for time :) Here are my thoughts on the progress so far:

I can understand the hesitation to use a custom format - "not invented here syndrome" is definitely a valid point of concern

However, I also think you might be too afraid of inventing a small wheel, when existing wheels don't quite fit. Ok, terrible analogy. I have written parsers for countless things, and the example I gave was specifically designed to be parseable by a little handfull of regexes. In fact, I would make it a design requirement that we could keep a simple list of regexes in a "grammar" file. This would ensure that the complexity never got out of hand, and that other parsers could easily be written.

So, I still think the custom format would be appropriate, which leads me to:

We are trying to hammer a round protocol into a square YAML

YAML is represents a data structure, and like JSON (et al) it is almost exclusively concerned with values, not structure. The proposed way of using YAML is deceptively alluring, but has several problems. First, comments are thrown away in (almost?) all parsers. Second, anchors are usually not visible after parsing. Third, we NEED size specifiers on enums and bools. Fourth, using tags, while certainly possible, would probably have to be so prevalent, as to be as complicated as just writing our own small regex-based parser. Fifth, we definitely need to support version data, and not on a different branch. Which leads me to:

YAML (and JSON, etc) are data formats, not grammars

Let's consider for a moment what we want to achieve here. I want the protocol to be in an introspectable format, because generating, checking and maintaining protocol code in several different languages (possibly even for different protocol versions) is both boring, difficult and error-prone.

By having a common source of protocol truth, we can generate code for all the languages and projects we want, and not worry about if some implementations lack a certain field, or got updated when we learned what unknown_field_17 did. It also means that the docs and code can always stay in sync. And of course, anyone is still welcome to implement by hand, but this change makes it possible to gradually change to generated code, for the boring bulk of the code, if nothing else.

To be able to generate the protocol de/serializers, we need to know the exact size layout, we need to know some (rather simple) parsing rules, and we need to know the mapping between bytes and values.

In my mind, we would have a common spec parser, and then a small generator for each language that we want to generate something for, along with a set of templates for other boilerplate code that isn't strictly related to the protocol.

Half parsing, half structure

One part of the challenge here, is to describe the exact format of various pieces of structured data (packets). The other challenge is how to make decisions on how to parse them. This part includes packet type determination (by id + possible sub-id), arrays (fixed length or token-delimited?) and bitstreams (how do we parse flags?)

If we want to simplify the project, we could skip the parsing specification for now, even though that's probably pretty simple to do. When we make a parsing description, the canonical packet names from #52 would be an ideal addition to the grammar. From there, it would be a series of "match and branch" tables, something like:

parser ServerPacket read u32:
  0x0351a5ac: valueFloat
  0x077e9f3c: shipSystemSync
  0x19c6e2d4: clientConsoles
  ...

parser valueFloat read u32:
  0x00: HelmSetImpulse
  ...

packet HelmSetImpulse:
  throttle: f32

Yes, there are a few corner cases we would have to figure out (like the damned inconsistent array formats), but that is not insurmountable.

Personally, I'm less and less convinced that YAML will be a good fit for this - and I don't think JSON is any better (worse, probably). I'm still very much open to arguments one way or the other.

We could start very small, and just describe enums in this (or some) format, and generate the index.html from a template that is 95% index.html, and 5% generated content. Then we would have a starting point.

At the same time, we could generate enums for the languages we use. Already at that point, that would have significant value.

I look forward to hearing your comments on this :)

rjwut commented 8 years ago

Honestly, the more I think about it, the more I feel like that I wouldn't make use of code generation anyway for IAN, mainly because of certain intelligent enhancements (mostly in the form of static helper methods) that I have put in my code that generated code would lack. Some examples from IAN, just with the enums:

BaseMessage.build() accepts an OrdnanceType value and returns a BaseMessage value for the message that asks the base to build that type of ordnance.
CommsRecipientType.fromObject() accepts an ArtemisObject and returns the a CommsRecipientType value corresponding to that object, or null if it can't accept comms messages.
ConnectionType.opposite() is a non-static method that returns SERVER for CLIENT and CLIENT for SERVER.
CreatureType.getModel() is a non-static method that loads and returns the 3D model for the creature represented by that value.
SpecialAbility.fromValue() accepts an integer bit field and returns a Set containing SpecialAbility values active in that field.

You get the idea. There are a bunch more in the packet classes.

Given that I'm unlikely to use this to generate code, the only thing it does for me is generates documentation, at which point we've only shifted the work from one document using a well-understood HTML syntax to another with a proprietary syntax, with the added burden of creating and maintaining a parser.

Perhaps we're going about this the wrong way. What if we could establish some conventions and make some modifications to the existing HTML documentation so that the data we want can be parsed from it with some XPath expressions?

chrivers commented 8 years ago

I completely agree that missing all those pieces of implementation would be a dealbreaker. However, that was never the plan. In my rust implementation, I have the same situation.

My suggestion would be to have template source files, that contain all the additional code. Depending on the language, this could be done in a number of ways. For example, I'm fairly sure you could make java classes that inherit the basic grunt work of the protocol parsing, and add all the logic you need? Then there's no conflict, and the generation part really is not difficult.

What do you think?

mrfishie commented 8 years ago

Alright, I think I can see now where our opinions on what exactly the structured format should do differ. I also have a few other thoughts to add, but I'm somewhat busy right now so I'll try to do a 'proper writeup' as soon as possible (mostly in response to @chrivers monolithic comment :stuck_out_tongue:).

mrfishie commented 8 years ago

Apologies for taking so long, but here are my thoughts - first off I'll explain how I think our opinions are differing (and what exactly my opinion on what this format should do are), and then I've got a few smaller comments/questions that are related.

I can pretty much see where our disagreements are coming from, from the title "YAML (and JSON, etc) are data formats, not grammars" in https://github.com/artemis-nerds/protocol-docs/issues/50#issuecomment-247951662. @chrivers's concept language for the documentation format is a DSL/grammar (akin to language parsers), which I guess makes sense. But I feel like a grammar language is far too complicated and unnecessary for what we need - the Artemis packet format is consistent enough that we can make assumptions in order to simplify what our documentation actually specifies.

For example, this line from the example code above:

parser ServerPacket read u32:
    ...

Two things here seem odd to me - firstly, specifying that the program should read a u32 for the packet type. This is a core part of the protocol, and I think that we can be pretty certain the size of the packet type is not going to change. Secondly, pretty much just the way this line is written - to me this looks like imperative code, whereas IMO this should be declarative (this will further link into why I think we should be writing this as a data format, not a grammar - but I'll discuss this shortly). Part of this is also the way a "parser" is being defined with a name - why are we abstracting the concept of the direction like this?

Put simply, I don't think it should be up to our document to define how things are done. We should define the structure of those things, and provide some primitives in order to allow us to do that (e.g. specifying that something is an array or a bitmap), but it is getting too far if we are writing code to specify how to do those operations. This is why I think a data format would best suite this project.

A few other points:

A data format is much easier for a regular human to read (you don't have to 'interpret code' in your head in order to understand whats going on)
A data format allows us to abstract away the actual inner workings of the parser (i.e the documentation file doesn't care how the parser works)
A data format is likely easier to parse in a wide array of languages as it does not require parsing to work in one way (relates to point 2)
By using a well-supported data format (a) we don't need to maintain our own implementation, and (b) other people using other languages that we haven't considered can still easily start using our format (it has a low barrier of entry, which we want).

Related to point 4, a few other things from https://github.com/artemis-nerds/protocol-docs/issues/50#issuecomment-247951662 (these are probably verging on nit-picking, so don't take them too seriously)...

I have written parsers for countless things

You have, yes. However potential future users of this documentation may not have, and this restricts the growth of a potential community.

we could keep a simple list of regexes in a "grammar" file

So now we have a grammar file for our grammar file? Also, different languages support different regex features - does this mean we'll need a grammar file for our grammar file for our grammar file, in order to list which regex features we require?

Now for a few other general comments on things...

comments are thrown away in (almost?) all parsers

Correct; I did come up with a potential solution to that (for YAML, at least) - see above.

Second, anchors are usually not visible after parsing.

The original purpose in proposing anchors is that, if we design the format well, they wouldn't need to be visible after parsing - effectively also allowing "anonymous enumerations" and possibly other things such as arrays of property lists, etc.

Third, we NEED size specifiers on enums and bools.

Size specifies on enums relates back to what I was talking about with assumptions. From some past experience working on a JS library for Artemis that used a similar format to what I'm proposing, enumerations are always used with the same size (e.g. MainScreenView is always a 32-bit int). As a result, we can move the size specifiers up to where the enumeration is defined, instead of where it is used. IMO it makes more sense putting it with the enumeration (since that is also where the values are defined), and it also reduces duplicate code.

Fourth, using tags, while certainly possible, would probably have to be so prevalent, as to be as complicated as just writing our own small regex-based parser.

I'm not sure I see your point here. Depending on the parser, I believe that tags are either simply attached to the value, or a callback defined by the code using the parser is run that allows it to modify the value to be injected into the final result (in which case you could just attach the type to the value very easily). It's not very complicated.

Fifth, we definitely need to support version data, and not on a different branch.

I would like to hear your reasons behind this. Keep in mind that Git does have a feature called 'tags' that allow you to mark the repository at a certain commit (often used to mark versions) - there's no need for different branches.

The reasoning behind my proposal to use tags instead of embedded version data is simply that it means we have to consider every possible way the protocol is likely to be changed and account for it - this could include enum changes, packet types could change names, fields could change types, etc. IMO it would end up getting far too messy and difficult to modify if we try to keep all of this information in the one file with embedded version data, and it would likely add a lot of complexity to the parser, no matter how we do it.

  0x0351a5ac: valueFloat
  0x077e9f3c: shipSystemSync
  0x19c6e2d4: clientConsoles

Continuing with assumptions, why do we need both the integer and string types here? Since we know how to convert from string -> int types, wouldn't it make sense just to use string types throughout?

To wrap things up, I guess that my opinion is that the parser generator (or simply a parser that consumes the documentation files) doesn't need to be completely dumb - we can definitely make assumptions on things in order to simplify the documentation. As a result, I think using a grammar-type language is overkill - we only want to define the structure of packets, not define how they should be parsed.

Also, we need to clear up exactly how much this documentation format should be doing. In my opinion it should just be the low-level stuff: enough to generate an API where the user can do something like:

server.send('CaptainSelect', {
    target_id: 500
});

The user is then free to build whatever they want on top of that - this includes static helper methods and other APIs. I guess this is similar to what @chrivers explained in https://github.com/artemis-nerds/protocol-docs/issues/50#issuecomment-248117618.

One more thing I wanted to mention: there's no requirement that this documentation file is used in order to generate code - it should be just as easy for a parser program to be able to load the file and then use it to parse packets (i.e no compilation step needed). While I don't think this would actually change anything, I thought I'd just mention it anyway to ensure it's accounted for.

Hopefully I've helped to clarify a few things on what I think the documentation format should look like. Thoughts?

IvanSanchez commented 8 years ago

Continuing with assumptions, why do we need both the integer and string types here? Since we know how to convert from string -> int types, wouldn't it make sense just to use string types throughout?

No, because it's common practice to run a network sniffer (wireshark et al), or otherwise print out the raw values of unknown (or doubtful) packets. This means that seeing the integers on screen is not unheard of, and it's nicer to have those in a readily readable form.

mrfishie commented 8 years ago

seeing the integers on screen is not unheard of, and it's nicer to have those in a readily readable form.

The generated documentation would, I assume, still have the integer types (they could be calculated by the program that generates the documentation HTML file/s), I'm just talking about whether we need to store the integer types as they can be calculated from the string types.

chrivers commented 8 years ago

Apologies for taking so long, but here are my thoughts

I thought that was my line ;-)

first off I'll explain how I think our opinions are differing (and what exactly my opinion on what this format should do are), and then I've got a few smaller comments/questions that are related. I can pretty much see where our disagreements are coming from, from the title "YAML (and JSON, etc) are data formats, not grammars" in #50 (comment). @chrivers's concept language for the documentation format is a DSL/grammar (akin to language parsers), which I guess makes sense. But I feel like a grammar language is far too complicated and unnecessary for what we need - the Artemis packet format is consistent enough that we can make assumptions in order to simplify what our documentation actually specifies.

Ah! Yes, of course. You are entirely correct, of course. I was definitely going for a minimalistic DSL, to have a way of describing the protocol data in a sane way.

I think it's definitely the right way to go, and I think the worries over complexity are overblown - however, I can clearly see that there's not a huge impetus to go in this direction.. :-)

For example, this line from the example code above:

parser ServerPacket read u32: ... Two things here seem odd to me - firstly, specifying that the program should read a u32 for the packet type. This is a core part of the protocol, and I think that we can be pretty certain the size of the packet type is not going to change. Secondly, pretty much just the way this line is written - to me this looks like imperative code, whereas IMO this should be declarative (this will further link into why I think we should be writing this as a data format, not a grammar - but I'll discuss this shortly). Part of this is also the way a "parser" is being defined with a name - why are we abstracting the concept of the direction like this?

Ah, perhaps the "read" was a poor choice of wording. The idea was that each "parser" (perhaps also a poorly named entity) would be a simple "match 1 value, take 1 action" type thing. Allow me to construct a slightly more fleshed-out example:

parser ServerPacket(u32):
    0x3de66711: startGame
    0x6d04b3da: plainTextGreeting
    0x80803df9: objectBitStream
 ...

object startGame:
    difficulty: u32
    game_type: GameType

enum GameType:
    Siege       = 0x00
    SingleFront = 0x01
    DoubleFront = 0x02
    DeepStrike  = 0x03
    Peacetime   = 0x04
    BorderWar   = 0x05

The idea was that any "parser" entity takes a value, compares it to some other values, and takes an action. The right hand side names aren't strings, they're other entities. If the next entity is a parser too, we repeat the process. If it's an "object", we read that, according to its list of fields. It's really a quite simple system, I think.

It also allows us to succinctly and precisely describe parsing of all subtypes, which is something that is not super clear right now. For example, many (but not all(!)) subtype IDs changed between u32 and u8 in version 2.1.1 and 2.4.

Put simply, I don't think it should be up to our document to define how things are done. We should define the structure of those things, and provide some primitives in order to allow us to do that (e.g. specifying that something is an array or a bitmap), but it is getting too far if we are writing code to specify how to do those operations. This is why I think a data format would best suite this project.

I completely agree - the spec should not be code! The read "keyword" was a bad choice. It's just the type that the list it matched against. How this translates to parser code in real life, the spec makes no assumptions about.

However, the fact that you need to compare a u32 to list of known values, is both valueable information, and something that any implementer absolutely has to do.

A few other points:

A data format is much easier for a regular human to read (you don't have to 'interpret code' in your head in order to understand whats going on)

Well.. maybe. The spec isn't code - I completely agree with that. However, a fairly complicated YAML encoding is also not super easy for humans to read. Hopefully, the documentation is the easiest-possible specs for humans :)

A data format allows us to abstract away the actual inner workings of the parser (i.e the documentation file doesn't care how the parser works)

Well, I certainly disagree here. We can standardize the syntax (YAML, for example), but we still need to agree exactly how it is used, especially since YAML is not a completely natural fit for this kind of thing.

Realistically, I don't think this will be a problem in practice, with either YAML or a custom format.

A data format is likely easier to parse in a wide array of languages as it does not require parsing to work in one way (relates to point 2)

For the syntax, yes. But you need to know what to do with the data.

By using a well-supported data format (a) we don't need to maintain our own implementation, and (b) other people using other languages that we haven't considered can still easily start using our format (it has a low barrier of entry, which we want).

I agree with a), certainly. Although it's really not a huge implementation, or much work.

Since we are discussing the theoretical aspects, I would argue that b) would be just as easy to get started with, especially if we (or I) contribute example generators for a few good use cases.

Related to point 4, a few other things from #50 (comment) (these are probably verging on nit-picking, so don't take them too seriously)...

I have written parsers for countless things You have, yes. However potential future users of this documentation may not have, and this restricts the growth of a potential community.

Yes, I have :)

I meant it more in a "therefore, I could get us up and running quickly" way. Right now we have no parsable format, so having one that perhaps has 1 downside, really isn't worse :)

we could keep a simple list of regexes in a "grammar" file So now we have a grammar file for our grammar file? Also, different languages support different regex features - does this mean we'll need a grammar file for our grammar file for our grammar file, in order to list which regex features we require?

Wait, wait. You just said it was bad if people would find this hard to parse, and now it's bad that we are helping them? You can't have it both ways, surely? ;-)

Now for a few other general comments on things...

comments are thrown away in (almost?) all parsers Correct; I did come up with a potential solution to that (for YAML, at least) - see above.

Well, it.. ehm.. I'm trying to be diplomatic here... ;-)

The problem is, the file looks fine, but the data structure is bordering on insane. And if parks us in quirks-land. If the comment is blank, then a) it has to be there for it to work, and b) yaml parsers don't have consistent behaviour with "weird" keys. Sorry to say, but it's not my favorite.

Maybe we could do something with tags instead?

Second, anchors are usually not visible after parsing. The original purpose in proposing anchors is that, if we design the format well, they wouldn't need to be visible after parsing - effectively also allowing "anonymous enumerations" and possibly other things such as arrays of property lists, etc.

I'm writing in a statically-typed language (Rust), and I need the enumerations to be there after parsing, otherwise I can't use this format.. I know things are a little easier in soft-statically typed languages (Java, C#), or dynamically typed language (Python, JS, etc).

Third, we NEED size specifiers on enums and bools. Size specifies on enums relates back to what I was talking about with assumptions. From some past experience working on a JS library for Artemis that used a similar format to what I'm proposing, enumerations are always used with the same size (e.g. MainScreenView is always a 32-bit int). As a result, we can move the size specifiers up to where the enumeration is defined, instead of where it is used. IMO it makes more sense putting it with the enumeration (since that is also where the values are defined), and it also reduces duplicate code.

Ah, that's a good point. I saw a whole lot of changes between 2.1.1 and 2.4, but it actually seems that most enums are either 32- or 8-bit, pretty consistently.

However, when writing a parser, it is much easier to have the needed information in one place - that is, with the packet descriptions. Otherwise, one would have to jump all over the document to find out the field sizes - what's the gain in that?

Although, if we generate the docs from this, we can probably save it just on the enum, and then write it everywhere we need, in the docs.

Fifth, we definitely need to support version data, and not on a different branch. I would like to hear your reasons behind this. Keep in mind that Git does have a feature called 'tags' that allow you to mark the repository at a certain commit (often used to mark versions) - there's no need for different branches.

Well, git tags certainly would not work here. That would imply that we never learned anything new about the (for instance) 2.1.1 protocol, or that we would have to do some serious rebasing when we wanted to update the 2.1 version. That doesn't make sense for me.

Branches are also no good, since we will then have ~95% shared code, but now suddenly writing a single library that speaks more than one version (without straight up having 2 complete libraries), becomes the much more difficult task of figuring out all the differences between 2 YAML data structures. That's quite a lot more difficult, than tagging the few differences we do know about.

The reasoning behind my proposal to use tags instead of embedded version data is simply that it means we have to consider every possible way the protocol is likely to be changed and account for it - this could include enum changes, packet types could change names, fields could change types, etc. IMO it would end up getting far too messy and difficult to modify if we try to keep all of this information in the one file with embedded version data, and it would likely add a lot of complexity to the parser, no matter how we do it.

It's hard to say - I think opposite branches would be quite a lot more complicated, especially for using more than one.

Otherwise, we could just give up documenting old versions, and refer people to the historic git versions for reference, but I don't like that either.

0x0351a5ac: valueFloat 0x077e9f3c: shipSystemSync 0x19c6e2d4: clientConsoles Continuing with assumptions, why do we need both the integer and string types here? Since we know how to convert from string -> int types, wouldn't it make sense just to use string types throughout?

As @IvanSanchez pointed out, it's very nice to have the hex values for network sniffing.

Also, the "strings" here are meant to be references, as noted earlier.

To wrap things up, I guess that my opinion is that the parser generator (or simply a parser that consumes the documentation files) doesn't need to be completely dumb - we can definitely make assumptions on things in order to simplify the documentation. As a result, I think using a grammar-type language is overkill - we only want to define the structure of packets, not define how they should be parsed.

I think that would be a nice feature, but if that's the deciding factor, we can certainly cut away that part :)

Also, we need to clear up exactly how much this documentation format should be doing. In my opinion it should just be the low-level stuff: enough to generate an API where the user can do something like:

server.send('CaptainSelect', { target_id: 500 }); The user is then free to build whatever they want on top of that - this includes static helper methods and other APIs. I guess this is similar to what @chrivers explained in #50 (comment).

I'm not quite following? The spec should describe the protocol serialization format. So far, there's only very little about semantics (a small paragraph about common packet exchanges at the beginning of games, but that's it).

One more thing I wanted to mention: there's no requirement that this documentation file is used in order to generate code - it should be just as easy for a parser program to be able to load the file and then use it to parse packets (i.e no compilation step needed). While I don't think this would actually change anything, I thought I'd just mention it anyway to ensure it's accounted for.

Hopefully I've helped to clarify a few things on what I think the documentation format should look like. Thoughts?

Thank you for taking the time to answer my mammoth post with a sibling! :)

I'm still firmly of the opinion that it's much easier to do this with a custom format than everybody seems to think it is. However, if no one else wants this, I don't see it happening.

There is another potential way forward. If we changed the HTML docs to be improved in certain ways, and to follow a strict style, then we could create a small program that checks the parseability of the HTML docs, without generating anything. Like the "patchcheck" tool that the Linux kernel uses to check for programming style, etc.

That way, we could simply clean up what we have, and it could be used without much change - but at the same time, we could enforce the validity with a simple HTML parser, that checks that certain simple conventions are kept in place. (such as all fields having a data type and a size, for example).

Thoughts?

mrfishie commented 8 years ago

I thought that was my line ;-)

Sssshhh :stuck_out_tongue:

The idea was that each "parser" (perhaps also a poorly named entity) would be a simple "match 1 value, take 1 action" type thing.

Yeah, I understand what you meant, but I still feel like even having a separate 'entity' in this manner is still overcomplicating a bit - why not just provide a list of packets by name?

The right hand side names aren't strings, they're other entities. If the next entity is a parser too, we repeat the process. If it's an "object", we read that, according to its list of fields. It's really a quite simple system, I think.

Again here I see what you mean, but I feel like representing it in this way is overcomplicating it compared to how we can represent it with a data format.

Re-using your example:

client_packets:
  StartGame:
    - type: startGame
    - difficulty: u32
    - game_type: *enums.GameType  # of course we don't need to do it like this, I'll come back to that

I just feel like this kind of format is easier to understand/read and write (especially for those who potentially aren't familiar with it). It isn't as flexible as your approach but I really don't think we need that flexibility.

For example, many (but not all(!)) subtype IDs changed between u32 and u8 in version 2.1.1 and 2.4.

Good point, but I'm not sure this is solved by your example either (at least in its current form), as parsers expect the same type for the field they are looking at. This is definitely something to consider, however.

However, a fairly complicated YAML encoding is also not super easy for humans to read.

If we do it properly, I think it should be easy enough - of course the same thing goes for a custom format, but IMO what we've currently got in respect to YAML is easier to read than the current custom format (of course I'd be biased, however).

For the syntax, yes. But you need to know what to do with the data.

You are correct, however with a custom format a user would potentially need to implement both a syntax parser and then something to actually run through the data and do things with it (of course these could be part of the same code, but they're still different processes effectively).

Wait, wait. You just said it was bad if people would find this hard to parse, and now it's bad that we are helping them? You can't have it both ways, surely? ;-)

I think you misinterpeted my point - I was just trying to point out the irony/complexity in requiring a grammar file for a grammar file (and then maybe a grammar file for a grammar file for a grammar file). It was mostly tongue-in-cheek though, so not really important.

If the comment is blank, then a) it has to be there for it to work, and b) yaml parsers don't have consistent behaviour with "weird" keys. Sorry to say, but it's not my favorite.

Yeah, those are valid points - I mostly proposed that just as a potential initial solution. Side note: I don't think the key/value pair for the comment would have to be there for it to work... I suppose this depends on the language/parser, but surely the _ (or whatever) key could be optional?

Maybe we could do something with tags instead?

Unfortunately, I think tags are pretty limited as to what characters they can contain - they're parsed as identifiers, so I don't believe they can have spaces.

I'm writing in a statically-typed language (Rust), and I need the enumerations to be there after parsing, otherwise I can't use this format.. I know things are a little easier in soft-statically typed languages (Java, C#), or dynamically typed language (Python, JS, etc).

That's a good point - but keep in mind we don't have to use anchors, something like the following would work fine:

client_packets:
  StartGame:
    - type: startGame
    - difficulty: u32
    - game_type: GameType

This would definitely need further investigation, however.

However, when writing a parser, it is much easier to have the needed information in one place - that is, with the packet descriptions. Otherwise, one would have to jump all over the document to find out the field sizes - what's the gain in that?

I don't really think this would be a problem - couldn't the type just be looked up from the list of enumerations that have been defined? (this would especially be easy as the document would be parsed into an AST before the packets are all looked at to generate code, or whatever we're doing with the file)

Well, git tags certainly would not work here. That would imply that we never learned anything new about the (for instance) 2.1.1 protocol, or that we would have to do some serious rebasing when we wanted to update the 2.1 version. That doesn't make sense for me. Branches are also no good, since we will then have ~95% shared code, but now suddenly writing a single library that speaks more than one version (without straight up having 2 complete libraries), becomes the much more difficult task of figuring out all the differences between 2 YAML data structures. That's quite a lot more difficult, than tagging the few differences we do know about.

Hmm, interesting points. I definitely agree that branches are a no-go, and I can see what you mean with tags, but I just don't think that in-text version information (at least in the form you previously presented) is the best way to go - it seems like it will get very messy, very fast.

Otherwise, we could just give up documenting old versions, and refer people to the historic git versions for reference, but I don't like that either.

This is pretty much what tagging would be.

As @IvanSanchez pointed out, it's very nice to have the hex values for network sniffing.

Keep in mind that the document isn't intended to be used in this way - that would be the purpose of the generated documentation HTML (unless its a program requiring that information, of course).

Also, the "strings" here are meant to be references, as noted earlier.

Yes, but as I mentioned previously, IMO the whole thing with references and separate entities in that manner is overkill and makes the documentation harder to read/edit when required.

I'm not quite following? The spec should describe the protocol serialization format. So far, there's only very little about semantics (a small paragraph about common packet exchanges at the beginning of games, but that's it).

Yeah, don't worry about that... re-reading that bit I wrote, the point I was trying to make didn't really make a lot of sense.

Thank you for taking the time to answer my mammoth post with a sibling! :)

Any time :wink:

There is another potential way forward. If we changed the HTML docs to be improved in certain ways, and to follow a strict style, then we could create a small program that checks the parseability of the HTML docs, without generating anything. Like the "patchcheck" tool that the Linux kernel uses to check for programming style, etc.

I like this idea, although I think a documentation file in a different format would potentially be better (if we can agree on how to do it, I guess :stuck_out_tongue:). This is, in a way, even vaguely bordering on my original XML idea - perhaps we could write an XSLT to transform the XML to HTML and display the site that way? (just an idea, although I know you're not likely to agree with it)

And, I suppose, in the words of @chrivers,

Thoughts?

mrfishie commented 8 years ago

So since it may take a while to figure out exactly what this documentation format should look like (and it would be a pretty big change from what we've currently got), I suppose the place to start would be re-arranging the markup to a standard format as @rjwut suggested - that way we get a parsable structure quickly, as well as the ability to easily convert this into a different format if/when we figure that out.

I'm willing to do some work on this and submit a pull request, does anyone else want to try it?

chrivers commented 8 years ago

Hey - I'll make a, haha, "proper writeup" soon ;-)

In the meantime, I'm a bit pressed for time. However, I thought it was easier to show you guys what I thought, instead of arguing about it, so after about 2-3 hours of coding, I've made great progress on a parser (almost complete) and a rust code generator (half done).

Give me just a short while to clean some things up, implement a few examples, and then I'll present it. If you still don't like it, then at least I've tried :)

So far, the entire parser is a whopping 88 lines of python. The code that generates rust modules for ClientPacket, ServerPacket, enums and bitfields is 51 lines of python.

Sure, there are features missing, but it's really quite manageable, and very easy to read and maintain the input files :)

chrivers commented 8 years ago

Ok, so it turns out I ended up implementing the complete solution. I now have a (still quite small) parser for the custom format. This is then connected to Mako, which is a standard templating system, with good documentation.

The templates then inspect and loop over the data structure as they see fit, which means absolutely any type of code or docs can be generated.

I'll clean up a few loose ends, and show you the result. I hope you'll like it! At this point, the tool is so useful that I can't go without it :)

Of course, this also means I've already converted the entire protocol spec to the new format. It's really very easy to read, and I hope we can all benefit from it. I'd be happy to help write generators for existing use cases like docs and Java, if anyone is interested :)

Give me a day or two for real-life obligations, then I'll be back

chrivers commented 8 years ago

Okay, so I promised I would get back in "a day or two".. that was five days ago :)

I haven't been sitting idly by, though. As I mentioned, I ended up implementing the custom parser and generator, to try out the idea.

After lots of work on it, I can unequivocally say: it WORKS, and it's AWESOME :D (disclaimer: perceived awesomeness not guaranteed. side effects may occur. always consult a physician before jumping on the hype train)

I'm currently using a 100% generated protocol serializer/deserializer, including support for arrays, enums, structs, dynamic array sized, etc.

I want to give you guys an overview of what I'm worked on here:

Isolinear Chips

The isolinear chips is a complete specification of the Artemis 2.4.x network protocol, in .stf (Simple Type Format). It is a 100% feature complete specification of the protocol, including some bug fixes that are not in the docs yet! :)

Even if no one else wants to use this, the utility for me is so great, that I'm definitely going to continue development of it. I very much hope that I can convince you that we should render the HTML docs based on this data source. We can make it a completely smooth transition, starting at 0% dynamic content, and slowly extending it.

Transwarp

The Simple Type Format is not artemis-specific in any way! It's a generic format for describing network protocols and data structures.

To do anything with .stf files, a compiler is needed to parse the input data, and run it through a template file. I've implemented Transwarp, a python-based compiler for .stf files. I'm in the process of cleaning it up just a bit, so it's not on github just yet. I'll update this ticket as soon as it is available!

Tricorder

When using transwarp to generate all the protocol parsing code, it's extremely easy to try another field layout. Simple changes a few lines in the specification, regenerate, and recompile. The risk of introducing errors is basically zero, since the program and the spec always stay in sync.

The next challenge is testing the protocol code against a corpus of real-life data packets. I've had some long, and very fruitful discussions with @noseynick, about collecting and handling game network data to test against.

The tricorder utility is going to make it very easy to work with packet dumps. Features include binary parsing, hex output, frame splitting, deframing (deadbeef-header removal), and packet type searches.

Using this tool, one might search the raw corpus for all instances of a type of message, and collect them in a single file. This allows us to generate any set of test data we want, based on a common collection of captured streams.

Holodeck

Tricorder is the tool to work with network dumps, but it does not include any corpus data. In the holodeck project, @noseynick and myself aim to collect a corpus of game data, to test parsing, do reasearch, and test out new ideas for packet layouts. This project may or may not end up on github, due to the potential sheer size of the data.

Disclaimer: I did not introduce these project names to @noseynick before mentioning them here, so they might not be final :)

Next steps..

I'm currently working on getting transwarp cleaned up, writing some examples for it, and putting it on github. This should give everybody a good chance of seeing how it works in real life.

I'm very much looking forward to hearing your comments on the format, and the outlined ideas. I've aimed for good internally consistency in the format, but feedback is always welcome.

If everybody is more or less on board with the format, then I hope we can discuss a transition plan for the docs, so we can gain the advantages that are to be had here :)

NoseyNick commented 8 years ago

The name Data would seem more appropriate, but also far less useful as a search term, so Holodeck will do fine :-) I'll see if I can make good use of your Isolinear Chips in my Perl. Thanks!

chrivers commented 8 years ago

@NoseyNick Yeah, I agree :D But "data" is probably going to have a million or so search results.. oh well ;)

@everybody Here's a big update :)

I've implemented transwarp as a complete compiler project. This is not just a one-off script file. It has a user-friendly command line interface, documentation (more coming), and should be relatively straightforward to use. Take a look here: transwarp project (https://github.com/chrivers/transwarp)

Now, in terms of good examples, I'm not 100% there yet. However, the complete artemis protocol spec is available in the isolinear chips project (https://github.com/chrivers/isolinear-chips)

Also, my complete rust templates are available in the duranium-templates project (https://github.com/chrivers/duranium-templates)

Using the transwarp compiler on this dataset, you can see for yourself how a complete parser can be generated. Granted, it's in rust, but I'm sure you can imagine that this can really be used to generated absolutely anything.

I think the next task will be creating a documentation template, that also runs on isolinear chips.

@rjwut what are your thoughts on slowly transitioning the documentation to a more generated format? I'll help, of course.

I look forward to hearing your thoughts on this.

The future is now ;-)

mrfishie commented 8 years ago

This is really great work, and I definitely don't want it to be in vain, but I think we still need to have some discussion and refine everything to make it even moar awesome before we begin the transitioning process.

A small nitpick - I think the name of the parser.stf file is a bit misleading - we're not defining a parser, but providing the structure a parser would need. Along with this, I think the name parser before each block is a bit misleading - maybe type or something similar would be better? Also, shouldn't FrameType be in enums.stf, as it's an enum? I'm not even sure an enum is the best way to define this data, as it looks like it isn't used in the stf files but instead by the parser generator (whereas other enums are referenced in the stf files). IMO we should just allow the parser generator to perform the CRC conversion instead of providing the data in the enum, but alternatively perhaps some kind of 'root type/parser' block, or even better, some way to provide the numbers in the Server/ClientParser blocks (maybe with the textual name as a comment, as the names aren't really relevant for generating parsers but are used for documentation).

A small side-note that may be worth considering for future development:

The Simple Type Format is not artemis-specific in any way! It's a generic format for describing network protocols and data structures.

I really don't think we should be making this as a generic format - it complicates the format and adds a lot of unecessary info to this data; this relates back to some points I've previously made, and overall I guess my point here is that we can assume this format will only be used with the Artemis format. I haven't yet wrapped my head around the whole thing so this point might not be relevant, however.

Also, if I have the time I may look at setting up a way to run automatic missions to collect even moar packet data for Holodeck - would you guys be interested in this?

mrfishie commented 8 years ago

Also another small nitpick: Simple Type Format (the name) seems really generic - it doesn't actually really describe the format at all (other than saying that it's simple, which I'd say is arguable :P). If you really want it to be generic, how about something like Binary Schema Format? (Schema may not be the best word here, maybe Binary Structure Format or similar might be better). If this is gonna be Artemis-specific, something like Artemis Packet Format might be better.

mrfishie commented 8 years ago

Oh, a few more things I forgot about in respect to Isolinear Chips:

In objects.stf, what do the numbers beside the object names refer to?
In parser.stf, you refer to packets as structs (e.g. struct<ServerPacket::Heartbeat>) in the same way as you refer to normal structs defined in structs.stf (e.g. struct<Ship>) - however, the packets aren't defined with the struct block, but instead as sub-blocks in the packet block. While I understand that they are still effectively structures, afaik in all cases other than this you use something<Name> to refer to a block with the type of something and the name of Name, whereas in this case you don't, so it seems to be breaking conventions. IMO, something like packet<ServerPacket::Heartbeat> would be better, however it seems to me that, since structures are used commonly enough, we could allow the omission of struct< and > and just use the structure names in those cases - I don't believe this would result in ambiguous syntax, and it'd make things cleaner and likely easier to read without all of those angle brackets everywhere.
Enum definitions use = whereas structures use : - please just use one :P (this would also allow parts of the parser to be re-used for enums and structs)
Put the types of enums in the enum definitions, not where they're used - since enums always seem to use the same type wherever they're used, it makes much more sense to put the enum type along with the enum definition (the types belong to the enums, not to the packets that use them - just like you don't define the return type of a function when you use it in typed languages [although you might define the type of the variable you're putting the return value into, you're not defining the return type]). I don't think this would be that hard to add, as you already have to look up the values in the enumeration when it's used - the type just needs to be stored along with this.

Alright, I think that's all for now. I'm planning on trying writing a small STF parser in JS to test things out for asbs-lib, so I'll let you know if I come across any difficulties.

chrivers commented 8 years ago

This is really great work, and I definitely don't want it to be in vain, but I think we still need to have some discussion and refine everything to make it even moar awesome before we begin the transitioning process.

Oh, yes, definitely :)

I didn't intend for this to be the end-all-no-discussion version. But I thought, that since this issue is already of.. astronomical length (heh), it would be easier to make it, then show it :)

A small nitpick - I think the name of the parser.stf file is a bit misleading - we're not defining a parser, but providing the structure a parser would need. Along with this, I think the name parser before each block is a bit misleading - maybe type or something similar would be better?

Good suggestion! I'm actually working on a (planned) slight change to the format already. All section names (enum, object, parser, etc) are going to be completely free-form! This means that we can just use the ones we think are the most descriptive. All sections can have sub-sections, and so on. This is almost where we are right now, but I took a few shortcuts to get this out the door.

Also, shouldn't FrameType be in enums.stf, as it's an enum? I'm not even sure an enum is the best way to define this data, as it looks like it isn't used in the stf files but instead by the parser generator (whereas other enums are referenced in the stf files).

Good point! It's actually very much deliberately. The compiler doesn't care which stf file a definition comes from, so grouping it into different files is just to make it easier for us. We can reorganize at any time, without affecting the output at all.

With FrameType specifically, I put it in parser.stf, because that's where it is used. It was (and still is, when searching, editing, etc) much more useful to have it close to where it is referenced.

IMO we should just allow the parser generator to perform the CRC conversion instead of providing the data in the enum, but alternatively perhaps some kind of 'root type/parser' block, or even better, some way to provide the numbers in the Server/ClientParser blocks (maybe with the textual name as a comment, as the names aren't really relevant for generating parsers but are used for documentation).

Good idea! The consts part is being adressed in the next version :)

A small side-note that may be worth considering for future development:

The Simple Type Format is not artemis-specific in any way! It's a generic format for describing network protocols and data structures. I really don't think we should be making this as a generic format - it complicates the format and adds a lot of unecessary info to this data; this relates back to some points I've previously made, and overall I guess my point here is that we can assume this format will only be used with the Artemis format. I haven't yet wrapped my head around the whole thing so this point might not be relevant, however.

It's already generic.. :)

All I mean by this, is that I didn't make any articial limitations, or gross assumptions about what it's used for. The templates are very (100%) specific to the project they are used in, but there's really no need for the stf itself to be. It's really just a markup language.

Also, if I have the time I may look at setting up a way to run automatic missions to collect even moar packet data for Holodeck - would you guys be interested in this?

Oh yes! Please! :)

Can you send me an email, and we can coordinate further? @NoseyNick and myself are still working on the challenge of having a reasonable capture format, generating test corpuses, etc.

The ascii-based data dump format we use, is (or rather, will be) documented in https://github.com/chrivers/tricorder. Right now, that repos does not have the newest stuff implemented.

Perhaps you can capture in pcap format first? Then I'm sure we will find a way to convert the data later.

Also another small nitpick: Simple Type Format (the name) seems really generic - it doesn't actually really describe the format at all (other than saying that it's simple, which I'd say is arguable :P). If you really want it to be generic, how about something like Binary Schema Format? (Schema may not be the best word here, maybe Binary Structure Format or similar might be better). If this is gonna be Artemis-specific, something like Artemis Packet Format might be better.

You know, that bugs me too. I think the best option would be to find a short, memorable name. That's what I tried to do with the other Artemis-related projects I made. Suggestions welcome, it's still in beta :)

chrivers commented 8 years ago

Oh, a few more things I forgot about in respect to Isolinear Chips:

In objects.stf, what do the numbers beside the object names refer to?

Yeah, it's not very logical. It's the size of the bitmask, in bytes.

In the next version, all structures can have associated constants, as well as the body. So this would be:

object Base:
    BITMASK_SIZE = 2

    # Name (bit 1.1, string)
    #
    # The name assigned to this base. In standard, non-custom
    # scenarios, base names will be unique, but there is no
    # guarantee that the same will be true in custom scenarios.
    name: string

    # Shields (bit 1.2, float)
    #
    # The current strength of the base's shields.
    front_shields: f32

(also, see below)

In parser.stf, you refer to packets as structs (e.g. structServerPacket::Heartbeat) in the same way as you refer to normal structs defined in structs.stf (e.g. struct) - however, the packets aren't defined with the struct block, but instead as sub-blocks in the packet block. While I understand that they are still effectively structures, afaik in all cases other than this you use something to refer to a block with the type of something and the name of Name, whereas in this case you don't, so it seems to be breaking conventions. IMO, something like packetServerPacket::Heartbeat would be better, however it seems to me that, since structures are used commonly enough, we could allow the omission of struct< and > and just use the structure names in those cases - I don't believe this would result in ambiguous syntax, and it'd make things cleaner and likely easier to read without all of those angle brackets everywhere.

Good point! Again, there's bound to be a few rough edges here and there, and this is probably one of them.

However, it is vital that the type is marked. Otherwise, the parser complexity explodes completely. Right now, the parsing rules for types are quite simple:

if it's an identifier, we're done
if it's of the form "name", then it's a composite type. Parse the name and the args, and parse the args as a list of types (recursively).

Now, without the struct type markup, there's no way to tell if, for instance, "f32" is the name of a struct, a built-in primitive, or something completely different?

It's possible to fix, but it would require a lot of namespace code in the compiler, for no real perceived gain.

Enum definitions use = whereas structures use : - please just use one :P (this would also allow parts of the parser to be re-used for enums and structs)

Ah, it's actually pretty deliberate! But I didn't get to that part of the documentation :)

Notice how enums and flags (which also use "=") have values, whereas structs, packets, etc have fields with types. Anything after "=" is parsed (as an int, for example) while ":" denotes a type is coming up.

That should be clearer, I agree.

Put the types of enums in the enum definitions, not where they're used - since enums always seem to use the same type wherever they're used, it makes much more sense to put the enum type along with the enum definition (the types belong to the enums, not to the packets that use them - just like you don't define the return type of a function when you use it in typed languages [although you might define the type of the variable you're putting the return value into, you're not defining the return type]). I don't think this would be that hard to add, as you already have to look up the values in the enumeration when it's used - the type just needs to be stored along with this.

I completely agree with this sentiment... but! It turns out the artemis protocol is so bloody inconsistent that this isn't true! I have 2 examples right now, but I think there's been a couple more I've seen in previous versions:

1. BeamFrequency:
PlayerShip object:
    beam_frequency: enum<u8, BeamFrequency>
ClientPacket::SetBeamFreq:
    beam_frequency: enum<u32, BeamFrequency>

2. MainScreenView:
PlayerShip object:
    main_screen_view: enum<u8, MainScreenView>
ClientPacket::SetMainScreen:
    main_screen_view: enum<u32, MainScreenView>

Now maybe, maybe one could do a few assumptions and other unsavory things to normalize this, but who's to say when the next crazy thing will happen?

Alright, I think that's all for now. I'm planning on trying writing a small STF parser in JS to test things out for asbs-lib, so I'll let you know if I come across any difficulties.

Sure, let me know if you need any help :)

If you don't mind me asking, do you intend to implement a templating system/some kind of generator as well? Certainly, more implementations are better than fewer, but isn't it a little bit of reinventing the wheel we just made? :)

mrfishie commented 8 years ago

I didn't intend for this to be the end-all-no-discussion version. But I thought, that since this issue is already of.. astronomical length (heh), it would be easier to make it, then show it :)

Perhaps we should start making issues on the Isolinear Chips (and related) repo/s for discussing things now - that might help to organise thoughts better and avoid massive responses like this one.

All section names (enum, object, parser, etc) are going to be completely free-form! This means that we can just use the ones we think are the most descriptive.

By this, do you mean that sections don't have a type?

The compiler doesn't care which stf file a definition comes from, so grouping it into different files is just to make it easier for us.

Yeah, I thought that would be the case, it's just that all the files seem to be named by the types of sections they contain, and then there's this random enum.

Just curious, how does the parser decide which stf files to open? Does it just open all of the ones in a folder?

I put it in parser.stf, because that's where it is used

But AFAIK it's not actually used by the STF files at all, but instead read by the parser, which is why I think it should be a different type (all other enums are used by the program) - but I suppose this doesn't matter anymore with the free-form sections.

but there's really no need for the stf itself to be. It's really just a markup language.

Yeah - I suppose my stance on this project has been that we should make the structure language specifically to describe the Artemis formats, as it allows us to simplify some things in the documentation. But of course, there are pros and cons to each side.

Can you send me an email, and we can coordinate further?

Sent :)

You know, that bugs me too. I think the best option would be to find a short, memorable name. That's what I tried to do with the other Artemis-related projects I made. Suggestions welcome, it's still in beta :)

If you're wanting to publicise this and encourage people to use it for other projects, IMO a descriptive name for the format would probably be best (or at least a name that somewhat gives away what the format is for) - e.g. for a Star Trek layman, Isolinear Chips probably doesn't mean a whole lot (although in this case that's probably fine as it's not really meant as a major public project).

So this would be:

object Base:
    BITMASK_SIZE = 2

    # Name (bit 1.1, string)
    #
    # The name assigned to this base. In standard, non-custom
    # scenarios, base names will be unique, but there is no
    # guarantee that the same will be true in custom scenarios.
    name: string

    # Shields (bit 1.2, float)
    #
    # The current strength of the base's shields.
    front_shields: f32

Oh gosh, now we have equals signs and colons in the same sections... I get that it's to differentiate between constants and types, but it also seems very easy to mix these up while writing - perhaps there's a better way to separate these? (e.g. surround a section of constants with %, although this might be a bit ugly for enums)

there's no way to tell if, for instance, "f32" is the name of a struct, a built-in primitive, or something completely different?

Couldn't you just disallow using those names? It seems to me this is a bit like using a function in a language that's defined in the standard API, as opposed to one you've defined - sections could just be thought of as types defined in the program. I guess this depends on how you parse the format.

It turns out the artemis protocol is so bloody inconsistent that this isn't true!

Oh wow, okay... It definitely wasn't like that back when I was working on my old parser, but clearly I haven't kept up-to-date with the documentation. After I read this I was considering pitching the idea of having enums specify a 'base type' and then places where the enum is used that don't use that type would provide their own, but I think that'd get far too annoying to work on and probably complicate the compiler considerably.

If you don't mind me asking, do you intend to implement a templating system/some kind of generator as well? Certainly, more implementations are better than fewer, but isn't it a little bit of reinventing the wheel we just made? :)

Potentially, I'm not yet sure - I may end up making a packet parser that reads these structure files and 'interprets' them on-the-fly (i.e no code gen necessary). Also, since we want other people working on Artemis-related projects to be able to use the format, I thought it'd be a good test to make sure people other than you can write a parser that makes sense ;P

chrivers commented 8 years ago

I didn't intend for this to be the end-all-no-discussion version. But I thought, that since this issue is already of.. astronomical length (heh), it would be easier to make it, then show it :) Perhaps we should start making issues on the Isolinear Chips (and related) repo/s for discussing things now - that might help to organise thoughts better and avoid massive responses like this one.

Good idea! This infamous ticket #50 is definitely getting out of hand :)

@everybody: so from now on,

syntax and compiler -> https://github.com/chrivers/transwarp protocol spec -> https://github.com/chrivers/isolinear-chips

Specifically, this thread is continued here:

https://github.com/chrivers/transwarp/issues/1

NoseyNick commented 6 years ago

not really sure I like the idea of including version data in the documentation [...] I would propose we simply use Git tags to keep track of versions of the documentation for each Artemis version

I beg to differ on this one. In my own code, I've tried hard to maintain compatibility with "all protocol versions" if at all possible... I've found it very useful to be able to mark particular fields / packets / parts of packets as "PROTOVERSION >= 2.6" or whatever, and I'm often reverse-engineering things from OLDER versions of the protocol, not just "the current". I think you'd struggle to improve docs for ALL versions unless the protocol-docs can have a different version scheme to the protocol itself. #78 I refer to some 2.4.0 updates that I'd presumably have to commit into 2.4.0, 2.5.1, AND 2.6 branches if we were maintaining separate protocol-docs for separate protocol versions :man_shrugging:

In the meantime... Should I be waiting for https://github.com/chrivers/isolinear-chips to replace https://github.com/artemis-nerds/protocol-docs or do we feel it already has done? In other words, the issues I've been adding #75 #76 #77 #78 #79 for protocol 2.6.204, 2.6.0, and earlier... should I be trying to turn into pull-requests for https://github.com/chrivers/isolinear-chips or https://github.com/artemis-nerds/protocol-docs or ... both? :confused: -- Cheers

NoseyNick commented 5 years ago

Someone just drew my attention to http://kaitai.io/ ... in case anyone felt like describing artemis bitstreams in a binary description language that isn't artemis-specific 😁 On first reading, it could definitely handle most of the obvious primitive types, packet types with various IDs and subtypes, I'm FAIRLY sure it could even do the object bitfield stuff, but I'm not yet 100% clear if/how it might handle different protocol versions.

NoseyNick commented 5 years ago

... and more generically: https://github.com/dloss/binary-parsing