HSAnet / glimpse_client

GLIMPSE is an end host-based network measurement tool.
http://www.measure-it.net
Other
6 stars 6 forks source link

HTTP download measurement does not finish #301

Closed MKV21 closed 9 years ago

MKV21 commented 9 years ago

It is the same as with #233, the measurement starts and never comes back, this blocks the whole application forever.

I have reviewed the code and fixed some minor issues, but that did not fix this. There might be a bug in the counters (connectedThreads, finishedThread, downloadingThread, ...), if they are not counted correctly (in an error case) the measurement might come to hold waiting for a thread which already finished. Even though it is complex I could not find anything wrong with it.

I am going to apply a temporary fix today: a timer which stops the measurement after one minute. But if the measurement blocks somewhere and does not return to the event loop this will not fix the problem.

MKV21 commented 9 years ago

The timer does not help, a http download measurement might still get stuck. There has to be a bug between taskexecuture.cpp#33 and taskexecuture.cpp#92/102/121 which leaves the following code:

It can not be in the result-method of the HttpDownload.

monstermunchkin commented 9 years ago

How exactly did you reproduce this bug/behaviour?

MKV21 commented 9 years ago

Unfortunately we can not reproduce this bug at will. It seems to happen randomly to almost all probes after they run for 2-48 hours (executing http download 12 times per hour). My own probe and Rolfs probe at our homes seem to crash very fast, so my best guess is that it has something to do with the download speed (we both have 100 Mbit Internet) if the probe is installed on a low-powered machine (rack41, which is a server with 200 Mbit Internet, does not crash so fast, but if I remember correctly it also crashed at some point).

I have added a new log-message to see if this happens within the HTTP download or in the TaskExecutor (but I have to wait until I am home to check the log).

MKV21 commented 9 years ago

I guess we have two bugs there:

  1. The thread->quit() and thread->wait() calls in downloadFinished() seem to be wrong, as quit() only makes sense if the threads had an event loop. But I am not sure why the wait-call would take forever in some cases. I have fixed that.
  2. Log output indicates a failure in the for loop in calculateResults(). I have checked again, I thought the probe was still running when the log output comes to a hold, but this is not the case. So there might be a division by zero or something like that in the calculations.

1 should be fixed (but not committed yet).

monstermunchkin commented 9 years ago

Has #310 resolved this issue?

MKV21 commented 9 years ago

Resolved by #310 (hopefully).

MKV21 commented 9 years ago

aaaaaaaaaaand it happend again. Two probes crashed somewhere in HTTPDownload::calculateResults() after num_threads++.

MKV21 commented 9 years ago

331 fixes some problems, maybe also this crash. There is still something of concern which Christoph and myself introduced some weeks ago:

"21:17:29" "INFO" "void InternalTaskExecutor::execute(const ScheduleDefinition&, MeasurementObserver*)" : "Starting execution of httpdownload" QObject::moveToThread: Cannot move objects with a parent QObject::moveToThread: Cannot move objects with a parent QObject::moveToThread: Cannot move objects with a parent QObject::moveToThread: Cannot move objects with a parent "21:17:29" "INFO" "bool HTTPDownload::startThreads(const QHostInfo&)" : "Started 4 threads"