Server health tests are a little unfair due to absolute timing values

realkinetix commented 6 years ago

As per this thread servers that are further geographically from a directory server get lower health scores due to the latency involved.

There's been some discussion on how to adjust the absolute curl timing value, such as:

Grab the average or median value of a series of pings to the server in question, and subtract that from the curl time (perhaps with a correction factor - maybe double that ping value before subtracting due to TCP handshake response time, etc.)

There may be more that could be done, but perhaps some fairly simple 'corrective' factors around that absolute curl process time would be helpful with health scoring the servers that are further away network-wise.

MrPetovan commented 6 years ago

I'll admit I didn't expect to have so much fun when I agreed to maintain the Friendica Directory.

AndyHee commented 6 years ago

Some preliminary results. Here a comparison of the two directories, one in Western Europe and the other Southeast Asia.

The datasets were taken at different times and have different total number of nodes (181 v. 270).

Results based on this equation:

discounted_request_time = request_time - (avg_ping * coefficient)

The coefficient: dir.friendica.social-20180430 = 3.49712908930528 dir.hubup.pro-20180512 = 0.146000348540368

index 39659275-feeb7076-504e-11e8-8c16-665fd4e83db3

AndyHee commented 6 years ago

@MrPetovan the explanation for the coefficient, I gave, is incorrect. https://github.com/friendica/dir/issues/43#issuecomment-386648866

I'll try to give the correct version shortly. Hope you have not already coded this.

AndyHee commented 6 years ago

Duplication of nodes

I have noticed there are some duplications in the base_url table.

Something like: http://meld.de/ https://meld.de/

But we are quite sure there is only one node running there, despite the difference in protocols.

Even more concerning are duplications of entries with identical protocols. So for instance in Hypolite's dataset there are ten (10!) entries for https://libranet.de and about 15 for https://friendica.ladies.community each with different request_time values.

What's going on there and how to fix this?

MrPetovan commented 6 years ago

The behavior is even stranger than you expect. These are the only 15 redundant base_urls in the dir.friendica.social database:

base_url	COUNT(*)
https://box25.it	7990
http://localhost	302
https://salesnet.tomeetu.de	22
https://friendica.ladies.community	17
https://libranet.de	10
https://privet.su	9
http://192.168.244.183	8
https://www.ladies.community	7
http://192.168.178.208	5
https://friendica.me	3
http://172.16.0.10	2
http://social.pelikancms.pl	2
https://social.gl-como.it	2
https://social.retr.co	2
https://friendica.christsmith.ca	2

The first issue is that there isn't a UNIQUE key on the base URL. The second issue is that there's no reduction to a normalized URL (without https) which would allow to rule out HTTP/HTTPS duplicates.

AndyHee commented 6 years ago

Ohh.. what did we let ourselves in for here... 😲

I think this issue has effected the stats somehow. The coefficients are too different. Could you run this query again with:

WHERE `request_time` IS NOT NULL
AND `avg_ping` IS NOT NULL

We would like to run some further tests. Thanks.

MrPetovan commented 6 years ago

Here you are:

2018-05-12-site-probe.csv.zip

I did deduplicate base_url but I didn't added the nurl column. It shouldn't skew the data too much.

MrPetovan commented 6 years ago

Of course, go ahead!

AndyHee commented 6 years ago

[OK, I deleted some of my redundant posts above]

Ko tested the two new datasets for us and we found some interesting developments. I'm summarising a three page long report here and will give the practical implication.

For "dir.frienica.social" the removal of duplicated nodes seemed to make the relationship between request_time and avg_ping even stronger. This is good.

However, for "dir.hubup.pro" the data showed there was no relationship between request_time and avg_ping . The above (see https://github.com/friendica/dir/issues/43#issuecomment-388536649) rather different coefficients were already some indication of this. These two graphs might give you further some idea of the problem. screenshot_2018-05-14_19-50-59

After removing all servers with zero avg_ping value (and some outliers), we now have established that in the Thai dataset, there is also a significant relationship between request_time and avg_ping. Which is good, because it allows us to use the OLS equations as planned.

discounted_request_time = request_time - (avg_ping * coefficient)

Here the coefficients (plus p-values about the likelihood of no relationship) "dir.frienica.social = 3.316080104 (p = 0.0) "dir.hubup.pro" = 4.965583808 (p = 0.001)

Practical implication

The calculation of the coefficient and the Q1, Q2, Q3, and IQR values (see here https://github.com/friendica/dir/issues/43#issuecomment-387473588) must exclude all nodes with zero avg_ping . These nodes (providing they have a request_time that is not zero) will of course still get a health score, but will not contribute to determining the coefficient and speed zones.

tobiasd commented 6 years ago

But then each directory server has to automatically recalculate the coefficients from time to time--right?

AndyHee commented 6 years ago

Correct, and also its speed score zones.

Here an example for zones: https://github.com/friendica/dir/issues/43#issuecomment-387473588

AndyHee commented 6 years ago

OK, here the coefficient. Please excuse this non-standard notation. I hope this makes sense.

Coefficient=SUM of all x*y / SUM of all x^2

x = avg_ping - (AVERAGE of all avg_ping WHERE avg_ping is NOT zero) y = request_time - (AVERAGE of all request_time WHERE avg_ping is NOT zero)

MrPetovan commented 6 years ago

Moved to https://github.com/friendica/friendica-directory/issues/4

friendica / dir

Server health tests are a little unfair due to absolute timing values #43

Duplication of nodes

Practical implication