Closed realkinetix closed 6 years ago
I'll admit I didn't expect to have so much fun when I agreed to maintain the Friendica Directory.
Some preliminary results. Here a comparison of the two directories, one in Western Europe and the other Southeast Asia.
The datasets were taken at different times and have different total number of nodes (181 v. 270).
Results based on this equation:
discounted_request_time
= request_time
- (avg_ping
* coefficient
)
The coefficient: dir.friendica.social-20180430 = 3.49712908930528 dir.hubup.pro-20180512 = 0.146000348540368
@MrPetovan the explanation for the coefficient, I gave, is incorrect. https://github.com/friendica/dir/issues/43#issuecomment-386648866
I'll try to give the correct version shortly. Hope you have not already coded this.
I have noticed there are some duplications in the base_url
table.
Something like: http://meld.de/ https://meld.de/
But we are quite sure there is only one node running there, despite the difference in protocols.
Even more concerning are duplications of entries with identical protocols. So for instance in Hypolite's dataset there are ten (10!) entries for https://libranet.de and about 15 for https://friendica.ladies.community each with different request_time
values.
What's going on there and how to fix this?
The behavior is even stranger than you expect. These are the only 15 redundant base_urls in the dir.friendica.social database:
The first issue is that there isn't a UNIQUE key on the base URL. The second issue is that there's no reduction to a normalized URL (without https
) which would allow to rule out HTTP/HTTPS duplicates.
Ohh.. what did we let ourselves in for here... 😲
I think this issue has effected the stats somehow. The coefficients are too different. Could you run this query again with:
WHERE `request_time` IS NOT NULL
AND `avg_ping` IS NOT NULL
We would like to run some further tests. Thanks.
Here you are:
I did deduplicate base_url but I didn't added the nurl
column. It shouldn't skew the data too much.
Of course, go ahead!
[OK, I deleted some of my redundant posts above]
Ko tested the two new datasets for us and we found some interesting developments. I'm summarising a three page long report here and will give the practical implication.
For "dir.frienica.social" the removal of duplicated nodes seemed to make the relationship between request_time
and avg_ping
even stronger. This is good.
However, for "dir.hubup.pro" the data showed there was no relationship between request_time
and avg_ping
. The above (see https://github.com/friendica/dir/issues/43#issuecomment-388536649) rather different coefficients were already some indication of this. These two graphs might give you further some idea of the problem.
After removing all servers with zero avg_ping
value (and some outliers), we now have established that in the Thai dataset, there is also a significant relationship between request_time
and avg_ping
. Which is good, because it allows us to use the OLS equations as planned.
discounted_request_time
= request_time
- (avg_ping
* coefficient
)
Here the coefficients (plus p-values about the likelihood of no relationship) "dir.frienica.social = 3.316080104 (p = 0.0) "dir.hubup.pro" = 4.965583808 (p = 0.001)
The calculation of the coefficient and the Q1, Q2, Q3, and IQR values (see here https://github.com/friendica/dir/issues/43#issuecomment-387473588) must exclude all nodes with zero avg_ping
. These nodes (providing they have a request_time
that is not zero) will of course still get a health score, but will not contribute to determining the coefficient and speed zones.
But then each directory server has to automatically recalculate the coefficients from time to time--right?
Correct, and also its speed score zones.
Here an example for zones: https://github.com/friendica/dir/issues/43#issuecomment-387473588
OK, here the coefficient. Please excuse this non-standard notation. I hope this makes sense.
Coefficient
=SUM of all x*y / SUM of all x^2
x = avg_ping
- (AVERAGE of all avg_ping
WHERE avg_ping
is NOT zero)
y = request_time
- (AVERAGE of all request_time
WHERE avg_ping
is NOT zero)
As per this thread servers that are further geographically from a directory server get lower health scores due to the latency involved.
There's been some discussion on how to adjust the absolute curl timing value, such as:
There may be more that could be done, but perhaps some fairly simple 'corrective' factors around that absolute curl process time would be helpful with health scoring the servers that are further away network-wise.