StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.1k stars 746 forks source link

Move from pickle to JSON serializer for message bus messages #1761

Open Kami opened 9 years ago

Kami commented 9 years ago

(I've mentioned this multiple times before, but now I'm creating ticket so we can track / prioritize it).

We should move from pickle which we use right now to the JSON serializer for the messages which are dispatched on the RabbitMQ message bus.

Using pickle is fine if all data which goes through the message bus is controlled and generated by us (and if you don't mind about cross Python version and / or cross programming language compatibility), but this is not the case here (triggers can contain arbitrary data since we poll 3rd systems, etc.). This caries a big risk since if data is maliciously constructed.

Switching away from pickle means we won't be able to directly dispatch class instances to the messages bus anymore and we will need to use something like Thrift / Protobus and define structure for all the classes and data we want to pass through the message bus. An alternative would be to write a custom JSON serializer for all the classes we want to dispatch on the message bus, but this requires more work and is less portable then using something like Thrift.

That's actually a good thing since those classes will now be decoupled from Python and we will be able to use Thrift / Protobuf definitions to work with them in other languages (e.g. JavaScript).

lakshmi-kannan commented 9 years ago

+100.

manasdk commented 9 years ago

msgpack might work also?

So it is primarily a security issue or a correctness issue?

Kami commented 9 years ago

It's primarily a security issue and secondary a "portability" issue (using thrift / protobuffs / msgpack /etc.) makes it easier to re-use the models and other classes in other languages.

manasdk commented 9 years ago

@Kami Ok. Portability is not a concern in the short-term - right?

Security is always a concern and maybe that is something which raises priority. Again, the attack vector and impact would be important in working out the relative priority.

The reason I picked pickle was to avoid dealing with serialization in the short term . The assumptions which I made were -

  1. The ingress and egress points are StackStorm and we know it will only be python in the medium term (there is no requirement yet that violates this assumption)
  2. We have control over messages and therefore can exert some control on the content so if some special encoding is required that can be provided to plug the security hole.

I do feel we ought to move over to json/msgpack/protbuf i.e. some portable serialization structure. This serialization most likely involves very strict model definitions etc and formalizing the content schema which could be somewhat tedious. Therefore my concern is this change is not simple so how far can we push it out?

Kami commented 9 years ago

Yes, I agree that portability is not a big issue or a priority right now, but once we will be addressing that, we will also need to address serialization issue and if we use something like Thrift we won't be reinventing the wheel and we will solve two problems at once.

And yeah, you are right about formalizing the schema - there might be some workarounds we could take in the short-term (e.g. write a custom "dynamic" JSON object serializer and deserializer - e.g. something like to_serializable_dict).

Another workaround / improvements in some cases could also be to just send around PK and the retrieve the object at the other end. This obviously might not always work or be the most efficient, but in some cases it might be better since it ensures we are working with the latest version of the object.

manasdk commented 9 years ago

@Kami In your opinion can we ship 1.0 without making this change?

Kami commented 9 years ago

I was just thinking about that today (blockers and things we should do for v1).

The following things were on my list as blockers for v1:

  1. Pack management API
  2. Schema and data migrations
  3. More reliable and stable remote SSH runner

In addition to that, I also think we should move message bus serialization to JSON before v1. Doing it post v1 would be a lot harder, especially since we will (probably) support more older versions and non-clean install upgrades then.

Itxaka commented 9 years ago

Sorry to chime in, but this:

Another workaround / improvements in some cases could also be to just send around PK and the retrieve the object at the other end. This obviously might not always work or be the most efficient, but in some cases it might be better since it ensures we are working with the latest version of the object.

Its a very important point that we have sufferend in our own system with Celery and passing objects around. It turned to be a mess as due to the nature of async tasks we got objects out of date, or 2 tasks may interact with the object at the same time so you overwrite stuff all around. Pointers to the object and obtaining it fresh just before doing anything, for the cost of a few milliseconds to obtain it, its the way to go IMO.

Kami commented 9 years ago

https://github.com/StackStorm/st2contrib/pull/257 would also be detected earlier (and we would save debugging time) if we used a strict data format and class definitions since it would fail on serialization and not only on de-seralization.

Kami commented 9 years ago

Sorry to chime in, but this:

Another workaround / improvements in some cases could also be to just send around PK and the retrieve the object at the other end. This obviously might not always work or be the most efficient, but in some cases it might be better since it ensures we are working with the latest version of the object. Its a very important point that we have sufferend in our own system with Celery and passing objects around. It turned to be a mess as due to the nature of async tasks we got objects out of date, or 2 tasks may interact with the object at the same time so you overwrite stuff all around. Pointers to the object and obtaining it fresh just before doing anything, for the cost of a few milliseconds to obtain it, its the way to go IMO.

Yeah, although if your message bus enforces message delivery order, passing around "simple" data (e.g. state changes) is better. Especially in distributed systems (think state machines).

bri365 commented 8 years ago

+100 would have saved a few days debugging #2926