Open Dieterbe opened 8 years ago
Hmm seems like the code recently moved from csv to multiple -a's. see 9a825b777e9bf44ecf1ce80ea05c6d13b07c755a and 87b6b616a0643f9c71b68a67996ff55da666b1d7
though 629724a77bd7013f3e5b99a853473eb4d310c6fb was just merged which introduces a csv based example again.
Your findings are pretty much correct: Toxiproxy 2.0.0's cli uses comma separated attributes, and only the last -a
flag is used due to how the cli parser works.
This was changed for the upcoming 2.1.0 cli, which fixes the multiple -a
issue, but removes the comma separated format due to issues with string values.
I'm assuming the reason the first json blob you looked at seems invalid is because of the blank name. This is actually fine since the server will use the default name of <type>_<stream>
, but this was also changed for 2.1.0 to have the cli handle this default as well.
As for your jitter=500 latency=1000 test, the expected behaviour would be random latency between 500ms and 1500ms. I'm not sure how much latency you were actually getting based on your description, but there might be something else going on if you were getting any less than 500ms latency on anything.
Sorry for all the confusion though, the order release / doc changes are being merged in don't line up as well as the probably should.
I'm assuming the reason the first json blob you looked at seems invalid is because of the blank name.
it looked invalid to me because the latency attribute has a value of 1000,jitter=500
which the http api can't decode because it expects an integer.
this should result in a cleaner error for the user. suggestions:
As for your jitter=500 latency=1000 test, the expected behaviour would be random latency between 500ms and 1500ms. I'm not sure how much latency you were actually getting based on your description, but there might be something else going on if you were getting any less than 500ms latency on anything.
yes, it seemed like a bunch of my requests were responding in a few ms, this for minutes in a row with many a few hundred req/s, as if no latency nor jitter was being introduced at all. the chart looked like a flat line at 3ms or so. but this was using the wrong csv approach so maybe the toxic was not activated at all and the latency that I did see might have been the consequence of my benchmark workload.
As for the problem with toxiproxy-cli toxic add -t latency -a latency=1000 -a jitter=500 cassandra
resulting in latency=0, I'm guessing the docker image is not up to date. when i build the tools locally i have to use the multiple -a approach, but in docker that doesn't seem to work. this matches your description of the issue where it only applies the last attribute (the jitter in this case)
FWIW in our projects we use CircleCI which has a free plan for open source projects, and we use it to automatically push docker images to dockerhub whenever a build succeeds (see https://github.com/raintank/metrictank/blob/master/circle.yml for example) this approach is working well for us.
update regarding the "it seems to be delaying requests some of the time but also long periods of undelayed, non-toxic proxying." problem:
1) using #140 i built a new docker container that includes the latest stuff from master.
2) i did a cycle where i run toxiproxy-cli toxic add -t latency --tox 0.5 -a latency=3000 cassandra
and then toxiproxy-cli toxic delete -n latency_downstream cassandra
a few times back and forth (with a couple minutes in between), while a tool is running that queries at a rate of 100Hz (see blue line)
check out the this chart where I did this 3 times
it shows the avg and max duration of a cassandra request (capped at 2s because my application times out queries and aborts them after 2s, that's why we're not seeing 3s there), measured each second. here are my observations: A) the toxic works more or less as expected for about 30s. the observed max values hit 2s although it doesn't look like a 50% rate since some seconds pass with a max of 3ms and an avg of 0ms. anyway this decrepancy might be due to statsd/graphite related artifacts or perhaps the way my app measures. B) afterwards, the toxicity seems to just disappear, always after 30s. when I then run the delete there's no difference whatsover in the numbers. if I then add the toxic again, it seems to work again (for 30s)
at one point i thought perhaps the cassandra client does something weird where it disconnects the original connection and then connects directly to my cassandra server which would explain why the toxic effect disappears, but everytime i delete and readd the toxic the toxicity reappears so that disproves that idea.
PS: sorry for throwing a bunch of possibly unrelated stuff into the same ticket, but it's not yet clear to me what are still actual issues and how to cleanly separate them from one another. I hope we can solve and clarify much of this before the need to open new tickets, and i hope to contribute solutions as well.
To be clear about versioning, currently git/master is "unreleased" and is subject to change. It's effectively development for Toxiproxy 2.1.0 right now. Docker and all release builds are Toxiproxy 2.0.0. The readme should also be documentation for 2.0.0 only, but it looks like I made a couple mistakes when merging things (go through and fix that soon).
Toxicity is per-connection, so what I think is happening is probably that your cassandra client is using a connection pool with persistant connections, so after 30 seconds of timeouts, the bad connections are probably being removed from the pool, leaving only non-toxic connections.
And lastly about the cli problems in master, I missed the bad attributes the first time I looked. I'll have to make some changes so that the error is better. The cli is designed to be independant of what toxics are defined, so that if a new toxic is added, the cli and api will still work with older versions. Technically "1000,jitter=500" might be a valid attribute for some other toxic type, so the error should be handled on the server.
I'd recommend using version 2.0.0 until 2.1.0 is fully released, since it will have the least issues.
problem 1) it's unclear how to specify multiple attributes when adding a toxic. the example in the readme only uses one attribute (https://github.com/Shopify/toxiproxy#cli-example) and the cli help docs don't specify how to specify multiple attributes (see help msg below). as a user I'm guessing either specify
-a key=val,key2=val2
or-a key=val -a key2=val2
problem 2) either approach i guessed above seems to work sort of, but things aren't exactly working.
approach 1 - comma separated items - seems to be the right way, see this run in the latest docker container...
... so the inspect output looks good. however, I'm having issues getting it to work. it seems to be deplaying requests some of the time but also long periods of undelayed, non-toxic proxying.
so I tried the other way but now latency is set to 0 for some reason, so maybe it's the comma separated approach after all? edit: in testing the toxic does seem to work effectively
When I try this outside of the docker container, using the latest code in git, the comma separated approach doesn't work at all and I have to switch to multiple -a args
the first one submits json like:
which looks invalid. the second approach submits: