TAXIIProject / TAXII-Specifications

A repository for development of the TAXII Specifications. For official releases, please see http://taxiiproject.github.io/releases/
http://taxiiproject.github.io/releases/
40 stars 5 forks source link

Proposal: Data Model and Naming Changes #56

Open MarkDavidson opened 9 years ago

MarkDavidson commented 9 years ago

In many places in the TAXII Specifications, very long names are used (ManageCollectionSubscriptionRequest comes to mind). Implementation experience shows that these long names are often used together (e.g., ManageCollectionSubscriptionRequest.SubscriptionParameters), and that the long names are cumbersome to work with (I wrote the spec and even I can't remember the name half the time).

This change proposes the following goals for data model and naming changes in a future revision of TAXII, in no particular order:

Further discussion of this topic should include a complete list of proposed changes.

jordan2175 commented 9 years ago

Brilliant. This will be very helpful for developers trying to learn TAXII and build tools for it.

jordan2175 commented 9 years ago

And some of the XML field names are really long and I would suggest that we shorten them. Things like: Collection_Information_Response could be Info_Res or even spelled out as Information_Response. I would also like to see "message_id" shortened to just "id". And when we have things like Collections -> Collection_Name, Collection Type as sub elements, I would prefer to just have them be Name and Type, no reason to prefix them with Collection as they are in the container called Collections.

jordan2175 commented 9 years ago

And I think the in_response_to should be just renamed idref as it is referencing some other ID value. This would make it similar to STIX.

MarkDavidson commented 9 years ago

@jordan2175,

I'm not sure I agree. The semantics between STIX idref and TAXII in_response_to are entirely different. In STIX, the idref is a pointer to an object (e.g., Indicator) that has been defined elsewhere, and in STIX it's important to be able to reference that content. In TAXII, in_response_to is really just a token that allows you to map responses to previous requests (and it's not really all that necessary in HTTP since request and response are bound together by the underlying protocol).

In STIX, I might specify an idref that somebody else defined and send you some content. In TAXII, a message_id is really only important until you receive a response to the message you sent and is not important to anyone outside of the exchange.

I also think a positive of in_response_to is that it's "self evident" - people look at the field and can reasonably infer what it means without having to read the field definition. A change to idref would break that (IMO).

jordan2175 commented 9 years ago

Very good points, I have changed my opinion about in_response_to

jordan2175 commented 9 years ago

@MarkDavidson I would like to get your thoughts around what you are thinking here. As Sergey has pointed out, the JSON version is just enough different now, due to structural changes, that it might be a good time to dry run what some of these changes SHOULD look like. We can do it in JSON without breaking anything, since there is not yet a JSON version.

MarkDavidson commented 9 years ago

@jordan2175,

If you're willing to take on the work, here's what I was thinking:

Does that make sense? I'm purposely not going into depth on my opinion of specific fields to see what others come up with =)

-Mark

jordan2175 commented 9 years ago

In looking towards other binding specifications and future binary versions and the fact that is is meant to be machine consumable not human readable, I think I would propose the following, realizing that anything we do for TAXII 2.0 will require a rewrite of code, so lets shoot for the moon... :) Please note I am proposing two separate ideas for each item, in the same line. First is the field name change and second, and independently, the idea of moving to a bit sized number. We can do the first WITHOUT the second depending on what people think.

TAXII Headers

id - formerly message_id, UUID - 128 bit number / 36 character wide
in_reponse_to - UUID - 128 bit number / 36 character wide
options - formerly extended_headers, name/value pairs where name and value restricted to strings

Status Messages

status
    type - 16 bit number, also when something is nested lets remove the namespace stuff
    details - name/value pairs where name is a string an value restricted to strings or array of strings
    message

Status Types I would leave these the same, these seem reasonable. I would however define the type field to be 16 bit wide and assign each of these a number. This way we are not passing them around as strings. People can look up the strings in a table in a constant table. Also making this a 16bit number allows for lots of room to grow. And we could say number over 1024 are non registered numbers that people can use how ever. But if they want others to use their status types, they should register them with you.

Status Details Nothing to change here

Discovery Messages

discovery
    type - 2 bit number (4 possible options for future growth) with current options of 0=request / 1=response
    services - formerly service_instance
        type (formerly service_type) - 4 bit number (16 possible options for future growth)
        version (formerly service_version) - 4 bit number, 0=urn:taxii.mitre.org:services:1.1
        available - boolean or maybe a 2 bit number (yes, no, perhaps if you do XYZ)
        address - string, more than just an IP address
        protocol (formerly protocol_version) - 8 bit number 
        encodings (formerly message_binding) - 8 bit number
        queries (formerly supported_query) - name/value pair where the name is the format_id and value is a string
        message

Collection Information Messages

information
    type - 2 bit number (4 possible options for future growth) with current options of 0=request / 1=response
    collections (formerly collection)
        name (formerly (collection_name)
        type (formerly collection_type) - 2 bit number, 0=DATA_FEED, 1=DATA_SET (plus 2 for future)
        available - boolean or maybe a 2 bit number (yes, no, perhaps if you do XYZ)
        description
        volume - 128 bit number
        push_methods
            protocol (formerly protocol_version) - 8 bit number 
            encodings (formerly message_binding) - 8 bit number
        get_services (formerly polling_services, I really do not like the poll name) or maybe polling_services
            address - string
            protocol (formerly protocol_version) - 8 bit number 
            encodings (formerly message_binding) - 8 bit number
        subscription_services
            address - string
            protocol (formerly protocol_version) - 8 bit number 
            encodings (formerly message_binding) - 8 bit number
        inbox_services (formerly receiving_inbox_services)
            address - string
            protocol (formerly protocol_version) - 8 bit number 
            encodings (formerly message_binding) - 8 bit number

Before I do anymore, I would like to see how for off I am in my views from others.

athiasjerome commented 9 years ago

I agree for Consistent pluralization Regarding Brevity, I would like to put a WARNING there. First of all, if you go this way, we MUST have a mapping between old and new names (and I mean a mapping that is parsable programmatically, not in documentation). Experience after years of coding is that taking into consideration the full application life cycle, clear (often means long) names provide a lot of benefits (time/money). Easy to search exactly what you're looking for in millions lines of code, easy search/replace, easier introduction for new developpers (less time spent on checking the documentation), etc. So for long term strategy, I don't recommend to have too much appetite for brevity. A global naming convention would have to be investigated, peer reviewed and accepted by the community. My 2c

athiasjerome commented 9 years ago

For transport (understand bindings/glue code) a mapping offering reduction of the size of the messages makes sense. BUT there are much efficient mechanisms to do that, that changing the length of parameters names in the specification, or even using integers instead of values for message size reduction. IMHO, it is too much effort (time/money) for results that you can obtain easily with dedicated of advanced mechanisms (cheap as already there). e.g.: changing lengths of parameters' names (or using mappings tables between strings in an enumeration and integers, etc...) in HTML or JS code to reduce bandswitch use VS using gzip compression There are a lot of ready/easy to use, highly effective, mechanisms to obtain binary versions of text for transport without changing specifications

jordan2175 commented 9 years ago

So I take from your comments that you agree with with the proposal, at least to some level? Other than some of the shorter names, which I really like, if we were to just use some sort of integer representation for the protocol_binding and message_binding fields, that would save a lot of over the wire space and storage space. Think of getting rid of "urn:taxii.mitre.org:protocol:http:1.0" and replacing that with "1".

MarkDavidson commented 9 years ago

@jordan2175,

I think you're largely on track from my perspective, and that completing what you've started would be a good input into a future revision of TAXII. I also agree with @athiasjerome that it's possible to go too far.

My personal thought is that for now we may as well stay away from coding values - this will be dependent on the chosen message format and can be done later as desired. I think the high value thing for now is the naming.

A couple thoughts on specific naming options:

So overall, I think this is the right direction and I think @athiasjerome makes good points to keep in consideration.

Thank you. -Mark

jordan2175 commented 9 years ago

@MarkDavidson really good points, and now that I step back, I agree with you on the information message. We need more context. To clarify the last point, for something like discovery, below, I would suggest that it not be services_type and services_version as it is already nested in the services container. No need for extra namespace designations.

discovery
    services - formerly service_instance
        type (formerly service_type) - 4 bit number (16 possible options for future growth)
        version (formerly service_version) - 4 bit number, 0=urn:taxii.mitre.org:services:1.1
traut commented 9 years ago

@jordan2175 thanks, it is great to shake "enterprisy" TAXII names a bit :)

I fully agree that the field names should be much simpler and shorter. @MarkDavidson made a good point though - this changes should also be in a future version of TAXII.

To avoid danger of having 2 different naming schemas, I think it is better to tread lightly: remove obvious redundant suffixes/prefixes but keep the structure aligned with main Spec document, and incorporate all your structure proposals into TAXII 2.0 Spec proposal. I remember there were doubts that we even need one, and I think this is a good reason.

Just a note: I have mixed feeling about limiting sizes of values and replacing string values into integers in order to save space. I think this should be done but I believe that the protocol should do that for us. We should not sacrifice readability in order to win some bytes or Kb. For example, we can have nice enums and long fieldnames in protobuf and protocol will pack it all up nicely. Forcing this on a spec level is like trying to minimize a novel while writing it instead of rely on gzip.

terrymacdonald commented 9 years ago

[+1] on the using the inherent compressibility of the underlying protocols do the work. But also [+1] for revising the field names as you have Bret. I would like to see the string values kept at least for now to keep at least moderately in step with the current version of STIX. It's definitely something worth discussing as part of the STIX v2.0 work though.

It will be interesting to do a comparison between Protobuf3 ( https://github.com/google/protobuf), SimepleBinaryEncoding ( http://real-logic.github.io/simple-binary-encoding/), Cap'nProto ( https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html), FlatBuffer(), Thrift () and even Avro ( https://avro.apache.org/docs/1.7.7/index.html) for sending a simple but big STIX watchlist over a link to see what is fastest. Maybe I've just come up with my next big project :). Doing some real tests with real data would be interesting from a future perspective, especially with STIX v2.0 on the horizon.

Cheers Terry MacDonald

On 15 April 2015 at 16:54, Sergey Polzunov notifications@github.com wrote:

@jordan2175 https://github.com/jordan2175 thanks, it is great to shake "enterprisy" TAXII names a bit :)

I fully agree that the field names should be much simpler and shorter. @MarkDavidson https://github.com/MarkDavidson made a good point though

  • this changes should also be in a future version of TAXII.

To avoid danger of having 2 different naming schemas, I think it is better to tread lightly: remove obvious redundant suffixes/prefixes but keep the structure aligned with main Spec document, and incorporate all your structure proposals into TAXII 2.0 Spec proposal. I remember there were doubts that we even need one, and I think this is a good reason.

Just a note: I have mixed feeling about limiting sizes of values and replacing string values into integers in order to save space. I think this should be done but I believe that the protocol should do that for us. We should not sacrifice readability in order to win some bytes or Kb. For example, we can have nice enums and long fieldnames in protobuf and protocol will pack it all up nicely. Forcing this on a spec level is like trying to minimize a novel while writing it instead of rely on gzip.

— Reply to this email directly or view it on GitHub https://github.com/TAXIIProject/TAXII-Specifications/issues/56#issuecomment-93223018 .

jordan2175 commented 9 years ago

After finishing my libtaxii APIs, I really would like to see us move to a model of adding a type value to the messages, as I mentioned up above. I believe this will reduce a significant amount of code and complexity. Meaning, instead of having discovery_request and discovery_response messages... I would like to see just a "discovery" message and the first element is a "type" value. The type can be either request or response. If we were to do that for all of the TAXII messages I believe we could reduce the code in libraries by greater than 25%.

MarkDavidson commented 9 years ago

@jordan2175,

It sounds like you're advocating something like:

<Message>
   <Type>Discovery Request</Type>
   <Whatever/>
</Message>

Over what currently exists, which is similar to:

<Discovery_Request>
   <Whatever/>
</Discovery_Request>

On the face of it, I'm not sure how one is all that different than the other - you'll have to do a switch statement (or similar) based on a text value at some point to decide processing.

The biggest difference I see is that you can't do XML schema validation based on element data (e.g., Type=Discovery Request) but you can based on an element name.

-Mark

jordan2175 commented 9 years ago

@MarkDavidson No, that is not what I am talking about.... I am looking for something more along the lines of:

<discovery>
    <type>request</type>
    <everything else>
</discovery>

<discovery>
    <type>response</type>
    <everything else>
</discovery>

<collection>
    <type>response</type>
    <everything else>
</collection>

<subscription>
    <type>response</type>
    <everything else>
</subscription>
MarkDavidson commented 9 years ago

@jordan2175,

To me, this seems to introduce a decision matrix. Instead of just looking at the root element name, you have to look at the root element name and the type field.

For instance, your new construct might be implemented like this:

if message.root_element == 'discovery':
    if type == 'request':
        process_message(message)
    elif type == 'response':
        raise Error("Wrong kind of message!")
    else:
        raise Error ("Unknown discovery type!")
elif message.root_element == 'collection': 
...

Whereas the current construct would just be:

if message.root_element == 'discovery_request':
    process_message(message)
elif message.root_element == 'discovery_response':
    raise Exception('Wrong kind of message!')
...
else:
    raise Exception("Unknown message type!")

Could you help me understand how what you propose would result in a reduction of complexity?

jordan2175 commented 9 years ago

@MarkDavidson We have a lot of APIs that need to still be built for other languages, think Objective-C for one. It would be nice if you only had to build one struct per message type. This would reduce code overlap and simplify inheritance for languages with it. I think it would also make it easier to understand holistically for a developer using the API if they had a single message to learn to work with for each type.

The decision tree already needs to account for a response message or a status message. The client would never get a request message back, at least not with the current 1.1 spec. In my option building the messages as I have suggested would make a lot of things easier and cleaner, having just spent the past few weeks writing the TAXII APIs for Go. This would also allows for some interesting numerical representations for when we move to a binary representation.

MarkDavidson commented 9 years ago

@jordan2175,

I'll challenge a statement you made:

The client would never get a request message back This is certainly true in the main success scenario, but it's certainly in the realm of possibility that a poorly written or maliciously written TAXII Server would respond with something outside the main success scenario. IMO, a well coded TAXII client would not make the referenced assumption - it would explicitly check the response given to make sure it matched the expected response.

I look at it this way - the name of the message (discovery, inbox) and the request/response information should always be used together to make processing decisions. Not using both pieces of information - to me - seems to be making the dangerous assumption that the server's response can be trusted to be correct.

I think this might be the heart of the discussion - whether the message name and request/response information should always be used together when making processing decisions.

What do you think? I'm attempting to get the the specific point where our opinions differ. My apologies if I've misunderstood your point.

Thank you. -Mark

jordan2175 commented 9 years ago

@MarkDavidson Starting at line 126, https://github.com/jordan2175/freetaxii/blob/master/client/discovery-client.go, that is exactly what I am doing. And I agree with you, clients should not trust the taxii server. The reason for my suggestion is I believe it will make tooling easier, especially for statically typed languages. Further I believe it will make things easier for future binary representations as well, especially if we had 5 messages with subtypes instead of 11 separate message types.