Thread leak - Githubissues

fandrei commented 11 years ago

When running in bad network conditions, CIAPI.CS spawns a lot of threads and their count grows infinitely. To reproduce, setup connection through WANem http://labs.cityindex.com/labs-team/2013/06/05/emulating-bad-internet-connection-using-wanem/ (setting 20% packet loss worked for me) and run CiapiLatencyCollector with -debug key for several hours.

fandrei commented 11 years ago

clipboard01

bitpusher commented 11 years ago

this evidence is anecdotal and shows not a thread leak in ciapi.cs, rather it shows a thread leak in appmetrics.

please support conclusions with evidence.

fandrei commented 11 years ago

I already have checked my code. You'd better try to reproduce bug before blaming me.

fandrei commented 11 years ago

PS leaked threads have a stack trace like this:

ntdll.dll+0xe4f4 clr.dll+0xb516a clr.dll+0xb4f98 clr.dll+0xb4dd8 clr.dll+0xb4e99 clr.dll+0x91329 [Managed to Unmanaged Transition] mscorlib.dll!System.Threading.WaitHandle.InternalWaitOne+0x2b mscorlib.dll!System.Threading.WaitHandle.WaitOne+0x2d mscorlib.dll!System.Threading.WaitHandle.WaitOne+0x10 DotNetClient_N2.dll!Lightstreamer.DotNet.Client.NotificationQueue.dequeue+0x118 mscorlib.dll!System.Threading.ThreadHelper.ThreadStart_Context+0x63 mscorlib.dll!System.Threading.ExecutionContext.Run+0xb0 mscorlib.dll!System.Threading.ExecutionContext.Run+0x2c mscorlib.dll!System.Threading.ThreadHelper.ThreadStart+0x44 [Unmanaged to Managed Transition] clr.dll+0x21db clr.dll+0x24a2a clr.dll+0x24bcc clr.dll+0x24c01 clr.dll+0x19b512 clr.dll+0xb5c05 clr.dll+0xb5c87 clr.dll+0xb5d42 clr.dll+0xb5dd9 clr.dll+0x19b3e5 clr.dll+0x19b2e0 clr.dll+0xb5a08 KERNEL32.dll+0xb713

bitpusher commented 11 years ago

this is where you seem to lose understanding of standard development process.

it is not my responsibility to reproduce bugs. it is your responsibility to provide evidence and reproducible results to support a bug report.

On Mon, Jun 10, 2013 at 3:16 AM, Andrei Faber notifications@github.comwrote:

PS leaked threads have a stack trace like this:

ntdll.dll+0xe4f4 clr.dll+0xb516a clr.dll+0xb4f98 clr.dll+0xb4dd8 clr.dll+0xb4e99 clr.dll+0x91329 [Managed to Unmanaged Transition] mscorlib.dll!System.Threading.WaitHandle.InternalWaitOne+0x2b mscorlib.dll!System.Threading.WaitHandle.WaitOne+0x2d mscorlib.dll!System.Threading.WaitHandle.WaitOne+0x10

DotNetClient_N2.dll!Lightstreamer.DotNet.Client.NotificationQueue.dequeue+0x118 mscorlib.dll!System.Threading.ThreadHelper.ThreadStart_Context+0x63 mscorlib.dll!System.Threading.ExecutionContext.Run+0xb0 mscorlib.dll!System.Threading.ExecutionContext.Run+0x2c mscorlib.dll!System.Threading.ThreadHelper.ThreadStart+0x44 [Unmanaged to Managed Transition] clr.dll+0x21db clr.dll+0x24a2a clr.dll+0x24bcc clr.dll+0x24c01 clr.dll+0x19b512 clr.dll+0xb5c05 clr.dll+0xb5c87 clr.dll+0xb5d42 clr.dll+0xb5dd9 clr.dll+0x19b3e5 clr.dll+0x19b2e0 clr.dll+0xb5a08 KERNEL32.dll+0xb713

— Reply to this email directly or view it on GitHubhttps://github.com/cityindex/CIAPI.CS/issues/230#issuecomment-19190658 .

bitpusher commented 11 years ago

i have not 'blamed' you or anyone.

i simply state that the anecdotal evidence that you provide shows a defect in appmetrics.exe.

that you believe, rightly or wrongly, the defect to be in a dependency is not sufficient evidence to support a bug report.

this is not the first time that you have submitted a 'bug' by simply thowing down a gauntlet and declaring 'there is a bug, it is your responsibility to find and reproduce it'.

this is not productive.

fandrei commented 11 years ago

And as far as I remember, these 'bugs' truly were bugs. The stack trace obviously shows these threads are spawned by the code related to Lightstreamer . And explanation how to reproduce this bug is in the very first message. Have you tried it?

mrdavidlaing commented 11 years ago

@bitpusher, @fandrei - its time to stop locking horns and figure out how to work together to reproduce & resolve the problem.

I propose the following:

@mrdavidlaing sets up a shared EC2 instance with V2012 and sends everyone login details
@fandrei, you to setup the CIAPI Latency Collector project & WANem on this instance, and set things running to reproduce the issue.
@fandrei updates this ticket alerting with instructions on how to get your hands on a stack trace like the one you have documented.
@bitpusher & @mrdavidlaing run the test case to "experience" the issue.
@bitpusher creates some "test" code within the CIAPI.CS project that replicates what is happening in the CIAPI Latency Collector project.
@bitpusher install said "test" code on the shared EC2 instance, and updates this ticket alerting with instructions on how to run the test case.
@fandrei and @mrdavidlaing run the CIAPI.CS test case to verify that it reproduces the issue accurately.
When everyone has a shared understanding of the issue, we brainstorm potential solutions.

Please indictate your acceptance of this proposal, or counter propose another plan

fandrei commented 11 years ago

@mrdavidlaing, does EC2 allow to boot instances from LiveCD? Alternatively, I can just "freeze" my testing VM and share it, so @bitpusher can resume it and debug.

mrdavidlaing commented 11 years ago

@fandrei - EC2 instances can only be booted from EC2 AMIs. Whilst it might be possible to convert a LiveCD to an AMI, we need a shared DEV instance anyway, so I'd like to proceed down that route.

mrdavidlaing commented 11 years ago

@fandrei - what is your testing setup? How many VMs? VirtualBox?

fandrei commented 11 years ago

@mrdavidlaing, 2 VMs (WANem and client). VMware.

Also we can try this software: http://www.softperfect.com/products/connectionemulator/ It installs as a normal program (WIndows only, though). And I'm not 100% sure it can work inside EC2. Need to check first.

mrdavidlaing commented 11 years ago

@fandrei - lets see if we can get http://www.softperfect.com/products/connectionemulator/ or http://www.logic-worx.com/index.php/tools-and-apps/fiddler-connection-simulator/ working on EC2 - sharing machine "state" by copying VM images around sounds really slow to me.

mrdavidlaing commented 11 years ago

@bitpusher - please can you indicate acceptance of my proposed plan of action, or an alternative. Thx

fandrei commented 11 years ago

Another alternative is the Network Simulation Tool http://tmurgent.com/Tools.aspx Works as a proxy, what should be enough. And it certainly will work even in restricted conditions.

bitpusher commented 11 years ago

@mrdavidlaing yes, sorry. i was taking some time to figure out why i woke up in shoot first mode monday and doing something about it.

mrdavidlaing commented 11 years ago

Machine is in us-east-1 named: DEV machine for CIAPI.CS/issues/230 - i-9337dcfc. See the Credentials tag for login details. The machine will autostop when not used for more than 1 hour; so you'll probably need to start it up via the AWS control panel.

I installed VS2012, CIAPILatencyCollector and http://www.logic-worx.com/index.php/tools-and-apps/fiddler-connection-simulator/; and whilst I could run CiapiLatencyCollector with -debug, Fiddler didn't seem to think it was making any requests.

@fandrei - over to you.

fandrei commented 11 years ago

@mrdavidlaing, thanks. I've finished the most of my tasks in csharp-bot-development-flow, so will work on this tomorrow.

fandrei commented 11 years ago

@mrdavidlaing, looks like proxy-based tools don't work well due to SSL. Can we buy this one? http://www.softperfect.com/products/connectionemulator/ (it's trial is limited to 30 second sessions)

mrdavidlaing commented 11 years ago

Sure. Buy the pro one and send me the invoice - I'll reimburse you via ODesk.

Regards David

On 13 Jun 2013, at 08:16, "Andrei Faber" notifications@github.com wrote:

@mrdavidlaing, looks like proxy-based tools don't work well due to SSL. Can we buy this one? http://www.softperfect.com/products/connectionemulator/ (it's trial is limited to 30 second sessions)

— Reply to this email directly or view it on GitHub.

This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is confidential, subject to copyright or which constitutes a trade secret. You should not copy it for any purpose or disclose its contents to any person. City Index Ltd has taken reasonable precautions to minimise the risk of transmitting software viruses but we advise you to carry out your own virus checks on any attachment to this message. We cannot accept any liability for any loss or damage resulting from any software viruses. If you receive this email in error, please telephone our postmaster on +44 20 7550 8500 or email postmaster@cityindex.co.uk and delete all copies on your system.

City Index Limited is authorised and regulated by the Financial Services Authority. FSA Register Number: 113942. Registered Office: Park House, 16 Finsbury Circus, London, EC2M 7EB. Registered in England and Wales, number: 1761813.

fandrei commented 11 years ago

This program doesn't work under EC2 :-/ Fortunately I haven't bought a license yet. Will try to make it working via a proxying program.

fandrei commented 11 years ago

My results so far: Fiddler+Bandwidth addon successfully logs requests, but bandwidth simulator seems to do nothing, with any options I've tried. TMnetsim: managed to get it working by suppressing SSL errors and launching 2 instances (for CIAPI and streaming server); however, streaming doesn't start due to errors. No success yet.

mrdavidlaing commented 11 years ago

Grr. Keep trying today; and if you don't get anywhere lets stop and reconsider our options.

-----Original Message----- From: Andrei Faber [mailto:notifications@github.com] Sent: Thu 6/13/2013 11:48 AM To: cityindex/CIAPI.CS Cc: David Laing Subject: Re: [CIAPI.CS] Thread leak (#230)

My results so far: Fiddler+Bandwidth addon successfully logs requests, but bandwidth simulator seems to do nothing with any options I've tried. TMnetsim: managed to get it working by suppressing SSL errors and launching 2 instances (for CIAPI and streaming server); however, streaming doesn't start due to errors. No success yet.

Reply to this email directly or view it on GitHub: https://github.com/cityindex/CIAPI.CS/issues/230#issuecomment-19384462

This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is confidential, subject to copyright or which constitutes a trade secret. You should not copy it for any purpose or disclose its contents to any person. City Index Ltd has taken reasonable precautions to minimise the risk of transmitting software viruses but we advise you to carry out your own virus checks on any attachment to this message. We cannot accept any liability for any loss or damage resulting from any software viruses. If you receive this email in error, please telephone our postmaster on +44 20 7550 8500 or email postmaster@cityindex.co.uk and delete all copies on your system. City Index Limited is authorised and regulated by the Financial Services Authority. FSA Register Number: 113942. Registered Office: Park House, 16 Finsbury Circus, London, EC2M 7EB. Registered in England and Wales, number: 1761813.

fandrei commented 11 years ago

Okay, it seems the SoftPerfect product was not working due to wrong configuration. So I'm buying it.

fandrei commented 11 years ago

I've created an isolated test; it's in the c:\temp_test folder https://github.com/fandrei/SimpleCiapiTest After running for about 20 minutes with 20% packet loss, it spawned about 90 threads. So it seems we have success here. One drawback is that connection emulator affects all connections via the same network interface, and it's hard to do anything with this instance when network failures are turned on :-/ Probably we can make it stop on schedule. Or create additional network interface. But I'm over for today.

mrdavidlaing commented 11 years ago

@fandrei - thanks for getting this test environment setup. One final question - what are the steps I need to run to execute your test?

Start SoftPerfect - how do I do this?
Run the exe in c:\temp_test? From the command line? From VS?

fandrei commented 11 years ago

for debugging purpose, it's better to checkout source code, put credentials into Const class and then start without any special keys
start SoftPerfect from the shortcut, choose network interface "Ethernet", set packet loss 20% and press"start"

mrdavidlaing commented 11 years ago

@bitpusher - over to you

mrdavidlaing commented 11 years ago

@bitpusher - could you give me an update on your progress here please

bitpusher commented 11 years ago

I am in the process of figuring out how to stop the connection simulator after a period of time. as andrei said, once you start the packet loss the machine becomes unreachable.

i have tried using a batch file to taskkill after a timeout but that doesn't seem to work.

i believe that the best way is to write a small app that will focus consim and then sendkey to consim to stop the packet loss.

i will give that a shot today. i am also in the process of adding verbose logging to the code in question.

fandrei commented 11 years ago

SCE supports the following command-line parameters:

consim.exe [/runtime:120] [config_file.xml]

If SCE is launched with a XML-file specified, it will load the specified configuration profile and begin emulation automatically.

If the /runtime switch followed by a number of seconds is present, emulation will run for the specified number of seconds and then the application will terminate.

bitpusher commented 11 years ago

grr....

so i see the start/stop button has an accelerator key 's' so i whip up a nice little vbs

set WshShell = WScript.CreateObject("WScript.Shell")
WScript.Sleep 100
WshShell.AppActivate 3832
WScript.Sleep 100
WshShell.SendKeys"%s"
WScript.Sleep 1800000
WshShell.SendKeys"%s"

guess what..... there is another button, save as, using 's' as well so the start stop button is not accessible.

so, looks like a c# app that drills down the window handle is needed.

fandrei commented 11 years ago

@bitpusher, see above

mrdavidlaing commented 11 years ago

@bitpusher, @fandrei - since you are both online currently, why not hop into the Google Hangout room and work together

bitpusher commented 11 years ago

i am nodding off already. andrei has done the homework digging up the command line switches so i will try them out in the morning when i have some verbose logging implemented.

the app seems to save last used parameters so an xml config should not be necessary.

mrdavidlaing commented 11 years ago

@bitpusher - do you have any updates on this issue?

bitpusher commented 11 years ago

@mrdavidlaing yes i do have observations to share. i am currently running extended tests to support said observations.

bitpusher commented 11 years ago

Ok, I can now, with confidence, address this issue.

There is no thread leak or memory leak in the CIAPI.CS library.

The fault tolerance in our lightstreamer code uses phased threads to ensure proper event delegation.

Each long polling http connection is hosted on it's own background thread.

When a long polling http connection is interrupted, by network instability for instance, a new connection must be initiated (and the phase of the listener incremented) and associated with the listener.

The old connection needs to be cleaned up, which takes some time, and happens on it's thread.

The thread count of an application using CIAPI.CS may fluctuate but will always, eventually, equalize and return to nominal state as has been proven using the VM and simpleciapitest provisioned and described earlier in this issue.

I have run the simpleciapitest for extended periods of time (48+ hours), introducing periods of network failure ranging from 1% to 20% packet loss for periods of up to 2 hours and the thead count and memory usage always returns to initial state over a period of time.

The conditions necessary to result in an OOM type error are far and beyond even an edge case.

Let us stipulate that when we talk about packet loss that we are speaking of upstream, provider infrastructure NOT bottleneck created by overloading or saturating the endpoint as in having very low or very unstable bandwidth or overloading with traffic.

Typically, 'acceptable' packet loss is 0.1%. Anything over 1.0% for any measurable period typically results in red flags and QOS agreements to go into affect.

5.0% packet loss is indicative of a serious network failure.

20% packet loss required to reproduce this issues results are catastrophic and is far beyond an edge case scenario thus should be viewed as anecdotal. I am unable to reproduce a fatal exception in the VM even with 20% packet loss for a period of 2 hours. But again, this is beside the point.

My conclusion is that there is not a thread or memory leak exposed by SimpleCiapiTest.exe.

The appearance of a thread/memory leak is due to unrealistic/unreasonable network problems that cause the fault tolerance code to spin up new connections faster than the old connections are cleaned up.

mrdavidlaing commented 11 years ago

@fandrei - Given that thread cleanup happens faster than thread creation under "normal" packet loss conditions ( < 5%), and only becomes a problem under exceptional conditions (20% packet loss); I agree with @bitpusher that the cost of re-architecting to fix this problem outweighs the benefits.

Do you agree?

fandrei commented 11 years ago

@mrdavidlaing, @bitpusher, I'm not 100% sure this is the same problem I experienced. In my case, there were some threads never stopping; 20% packet loss was used only to make the problems surface faster. I will run more tests. Don't stop that instance please.

As a side note: spawning multiple threads for short running tasks doesn't sound like a good idea for me.

fandrei commented 11 years ago

This chart shows thread count during my test run: clipboard04

I set connection emulator to 256 Kbps, latency 1 sec and 'acceptable' packet loss 0.1%. During overnight run, thread count grown up to ~160 threads. At ~2:00 am I turned off connection emulator, and thread count stopped growing. However, after running 10 hours in perfect network conditions I don't see any signs of returning back to normal.

The test is still running, so @bitpusher can logon and debug it.

fandrei commented 11 years ago

@mrdavidlaing, how to prevent the test instance from autostopping?

sopel commented 11 years ago

@fandrei - the automatic stop is orchestrated via the respective CloudWatch alarm AUTO-STOP-awsec2-DEV-machine-for-CIAPI-CS-issues-230-i-9337dcfc-High-CPU-Utilization, which is currently configured to stop the instance after 24 consecutive hours with an average hourly CPU utilization below 5%. I've checked and over the last 24 hours this value never dropped below 24% at all, so the running test seems to generate enough load to prevent the automatic stopping.

It's worth noting though that until 0100 UTC today this value has been pretty steady around 85% and dropped to around 28% thereafter since then, so something has changed at least.

fandrei commented 11 years ago

@sopel, thanks

mrdavidlaing commented 11 years ago

@sopel - will this instance get rebooted during out automated backup process?

sopel commented 11 years ago

@mrdavidlaing, @fandrei - of course it will, since it has a name, entirely forgot about that aspect; but the backup schedule is 0300 UTC currently, so given the Jenkins master has the correct time, this doesn't seem to correlate (though it would have been a good explanation for the change).

fandrei commented 11 years ago

@sopel, @mrdavidlaing, is it possible to prevent rebooting? This will destroy the state of the running test process.

mrdavidlaing commented 11 years ago

@fandrei - This should be possible; and I've attempted to implement a backup with no reboot for this instance.

@sopel - I've done the following - do you think this will work?

tagged the instance with backup-type="no-reboot"
amended Jenkins Labs AWS backup - create job to:

create-images.py --filter instance-state-name=running --filter tag:Name="*" --exclude tag:cost-centre="labs-build-infrastructure" --exclude tag:backup-type="no-reboot"
create-images.py --no_reboot --filter instance-state-name=running --filter tag:cost-centre="labs-build-infrastructure" --filter tag:backup-type="no-reboot"

sopel commented 11 years ago

@mrdavidlaing - no, unfortunately this won't work; I had already tested this (should have mentioned it, sorry) and been contemplating how to remedy the following issue with the botocross filter implementation:

filters follow fairly strict rules as follows:

You can specify multiple filter values. For example, you can list all the instances whose type is either m1.small or m1.large. The resource must match at least one of the values to be included in the resulting resource list. You can also specify multiple filters. For example, you can list all the instances whose type is either m1.small or m1.large, and that have an attached EBS volume that is set to delete when the instance terminates. The instance must match all your filters to be included in the results. [emphasis mine]

Now the --exclude logic so far simply acts like --filter and I'm excluding the resulting set from the included one, which means that multiple --exclude specifications must all match in order to yield a result, which his pretty much against what one usually want to achieve here (incidentally I tried the very same logic you applied now, which results in no instance matching the combined --exclude selectors and all being backed up therefore).

Consequently I need to adjust/amend the evaluation logic in botocross and haven't yet come to a conclusion what would be the best option in terms of user expectation vs. least amount of required changes.

:information_source: Btw., I'm testing the filters via describe-instances.py before using them to ensure they yield the desired result (see the Labs AWS backups - prepare job).

sopel commented 11 years ago

The easiest ad hoc workaround to enable the testing scenario would be to simply remove the Name tag from that instance, which might be acceptable given the limited use case?

cityindex-attic / CIAPI.CS

Thread leak #230