inconsistency in clients cardinality.

jschaeff commented 1 year ago

(Reported by @vpet98) I noticed some inconsistency, to an extent that I don't know if should be ignored, about the number of clients and HLL objects in the results that the webservice returns.

Try this: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&details=country&format=json And then the same in node level: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&level=node&details=country&format=json

You would expect adding the clients of the results of the second query to be approximately equal to the clients in the first query. But the difference is quite noticeable (first query 78 clients, second query in total 103 clients). And is even worse for countries with more clients (in another example I had 2232 vs 3115 clients).

My SQL query includes this in the select clause: hll_union_agg(dataselect_stats.clients), which has to be correct. Then I use this library: https://github.com/AdRoll/python-hll. And as the library indicates in its README, I print the cardinality like this: HLL.from_bytes(NumberUtil.from_hex(row.clients[2:], 0, len(row.clients[2:]))).cardinality(), for each row that the SQL query returns. 7:50 PM

Could you have a quick look at it if there is time?

jschaeff commented 1 year ago

I checked my other internal statistics database which also uses HLL objects. I don't reproduce the issue.

The first request is one global evaluation, the second is the sum of the evaluations by country.

resif-pgprod.u-ga.fr resifstats@resifstats=> select #hll_union_agg(clients) from dataselect_aggregate_weekly where bucket >= '2023-01-01';
      ?column?      
--------------------
 2597.9111945492627
(1 row)

resif-pgprod.u-ga.fr resifstats@resifstats=> select country,#hll_union_agg(clients) from dataselect_aggregate_weekly where bucket >= '20
23-01-01' group by 1 order by 2;
resif-pgprod.u-ga.fr resifstats@resifstats=> select sum(a.clients) from (select country,#hll_union_agg(clients) as clients from datasele
ct_aggregate_weekly where bucket >= '2023-01-01' group by 1 order by 2) as a;
        sum         
--------------------
 2621.0269278647893

jschaeff commented 1 year ago

But the same check in the EIDAstats dataase is wrong.


resif-pgprod.u-ga.fr eidastats@eidastats=> select #hll_union_agg(clients) from dataselect_stats where date >= '2023-01-01' and country='GR' ;
 ?column? 
----------
       79
(1 row)

resif-pgprod.u-ga.fr eidastats@eidastats=> select sum(a.clients) from (select node_id,#hll_union_agg(clients) as clients from dataselect_stats where date >= '2023-01-01' and country='GR' group by 1) as a;
 sum 
-----
 104

jschaeff commented 1 year ago

So this means that the problem is already in the hll objects in database, not during extraction by the webservice.

One way to try to figure this out would be to reimplement the agregator using another library and, if needed, another language (maybe in java with java-hll). See if we get different results.

jschaeff commented 1 year ago

@vpet98 do you feel like implementing a new aggregator in java and compare the results ?

vpet98 commented 1 year ago

In your other internal statistics database with HLL objects, where this problem doesn't appear, what do you do differently?

jschaeff commented 1 year ago

Aggregation of the events is done internally by postgresql engine. No python-hll lib here, just the postgres-hll extension.

So if we want to validate that python script is not doing something fishy, we have 2 ways:

rewrite the agregator in java and compare results for the same inputs
inject the events in a postgresql table and use inner hll capabilities to generate a new aggregation table.

vpet98 commented 1 year ago

Ok, I'll try the java option. But I need some more info on how to test locally and what exactly the aggregator does after aggregating the statistcs objects (where does it put them).

jschaeff commented 1 year ago

It seems that there is no problem in the HLL values. When you issue requests like:

select sum(a.clients) from (select node_id,#hll_union_agg(clients) as clients from dataselect_stats where date >= '2023-01-01' and country='GR' group by 1) as a;

then you will count clients twice.

EIDA / eida-statistics

inconsistency in clients cardinality. #45