Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Analyses page confusion #3476

Closed sebastian closed 5 years ago

sebastian commented 5 years ago

Here is further feedback on the analyses page from Telefonica:

From Friday: image001

From Monday: image003

sebastian commented 5 years ago

Reference email (for Sebastian) in Missive. TL;DR: Feedback for Robert at TLF

obrok commented 5 years ago

they don't understand what the different column headers actually mean. Solution: provide description somewhere

Straightforward action item, that I think should be a separate ticket.

the number of total columns grew over the weekend without any changes to the data source. Could this be the cloak redetermining the schema periodically in the background?

What I could find was (https://github.com/Aircloak/aircloak/blob/master/cloak/lib/cloak/data_source.ex#L114):

  Validates that the Cloak can connect to the data source, and updates the online status of the
  data source. If the data source has been offline, it also has it's table definitions refreshed.

so it indeed seems possible that a data source gets rescanned and if columns were added to the underlying database, they will get picked up. I'm not sure why they are surprised, though? Do they not know that columns were added?

the data source vdc1w_dw has been online for a month, yet has incredibly low numbers of analyses operations completed. Why?

Unclear to me how to approach solving such a problem at this time... Perhaps we can talk on slack.

from the stats (diff Friday to Monday) it looks like it is going to take ~500 days before the analysis is complete. I think the system would benefit from making a) showing a rough estimate on how long it will take based on average per column time, and b) show when the next cycle is likely to start again

Given the refresh period is smaller than 500 days, the analysis will never stop - it will just cycle through the columns. Because of that, while I do agree that showing some indication of how long it's going to take and when will it restart is nice, I don't think it helps in this particular case.

sebastian commented 5 years ago

Unclear to me how to approach solving such a problem at this time... Perhaps we can talk on slack.

This one has been solved (with update that was made today). The cause was that all LIMIT in the shadow db queries caused the query to become emulated. This in turn led to entire tables being loaded out of the database. It was slow and incredibly memory hungry.

Given the refresh period is smaller than 500 days, the analysis will never stop - it will just cycle through the columns. Because of that, while I do agree that showing some indication of how long it's going to take and when will it restart is nice, I don't think it helps in this particular case.

Well in this case it would have to show that it would be never ending. So I guess the logic would be:

obrok commented 5 years ago

Well in this case it would have to show that it would be never ending. So I guess the logic would be:

  • if analysis time is shorter than repeat time, then show when it will start again
  • if analysis time is longer than repeat time, then show that it will likely go on indefinitely

So I guess that's what's left in this issue, is that right?

sebastian commented 5 years ago

Correct 👍

sebastian commented 5 years ago

I consider all the relevant and useful features to be implemented in this issue.