Closed joe-keen closed 8 years ago
segfault is very strange given that the module is pure python. So the first thing that jumps out at me there is ujson, which is a wrapper around a C library. I would expect that to be the most likely cause of segfaults in your script.
Also note that you do not need to import differently in 0.9.5 v. 1.0.0 -- just import everything from kafka
instead of kafka.client / kafka.consumer / kafka.producer.
Thanks for the test script, I'll play around and see what I find.
The import difference was to be sure we were using the same client that was available in 0.9.5. ujson shouldn't have had any effect; the segfault was a few minutes after the last use of it. I was doing testing of this in a vagrant environment which you can use for your testing if you want. I was able to reproduce the segfault in multiple VMs built from the same Vagrant file.
The performance regression seems to be due to the SO_RCVBUF handling that was added for KafkaConsumer / KafkaProducer. I can patch that to behave as before when using the old interfaces. That patch gives me 50K messages per sec in my local testing.
I have not been able to reproduce the segfault. I remain skeptical b/c there is no C code in kafka-python. Any further information you can provide on that would be helpful. Otherwise, I'm not planning to investigate that part any further.
I'm still able to make it segfault consistently in my Vagrant environemnt even with ujson replaced with json. I can't seem to get a viable core file out of it, most of them are truncated for some reason. With my most recent Vagrant build, using kafka version kafka_2.9.2-0.8.1.1, I was unable to get a segfault initially. I placed the test script I provided in a bash while loop and after a short while it was segfaulting every run. That behaviour remains after letting the Vagrant environment sit idle overnight.
The fd leak you fixed yesterday looks like it would be pretty minor if I'm reading it correctly. Do you think it's worth testing the kafka-python client with that fix to see if the segfaults go away?
If you want to try using the same environment you can use it with the following steps:
The fd fix only affects the new classes. You are testing with the old classes, so I don't think it will matter. You might try installing with PR 557 though. I believe that should at least improve the performance you are seeing.
I captured a stack trace that shows the program segfaulting trying to construct a new tuple. At the time it segfaulted it had allocated slightly over 9GB of ram so it looks like an out of memory error. If I can figure out how to get the gdb python extensions installed in my environment I'll be able to give more information.
The stack trace for what it's worth: Program received signal SIGSEGV, Segmentation fault. 0x0000000000537388 in _PyObject_GC_Malloc () (gdb) where
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
stack_end=0x7fffffffe5b8) at libc-start.c:287
It looks like it's running into the Out Of Memory killer which explains why you can't reproduce it. My vagrant VM has fairly limited resources.
[ 5285.531613] Out of memory: Kill process 16916 (python) score 759 or sacrifice child [ 5285.534207] Killed process 16916 (python) total-vm:9823872kB, anon-rss:9717040kB, file-rss:0kB
Running top while running my test script I can see it allocate several hundred MB every five seconds or so.
@joe-keen 1.0.2 was just released. if you have time, you might test it out and see if you see the same problems.
I am unable to reproduce -- please reopen if you see similar behavior on the latest release.
I have a test script that uses a KeyedProducer and a SimpleConsumer to test the consumer performance at a various fetch sizes starting at 50K up to 2M in increments of 25K.
In 0.9.5 I see a read speed from the SimpleConsumer of 50K to 60K messages per second depending on the fetch size. In 1.0 using the same test scripts I see speeds in the 7K to 10K range using the same objects and it seg faults before it reaches the 2M fetch size. I've seen seg faults at fetch sizes as low as 225K.
Attached is the test script I use. kafka_test.zip