MichaIng / DietPi

Lightweight justice for your single-board computer!
https://dietpi.com/
GNU General Public License v2.0
4.84k stars 495 forks source link

DietPi-Survey | Do not send an empty file opted out? #5812

Open ngrigoriev opened 1 year ago

ngrigoriev commented 1 year ago

Creating a bug report/issue

Required Information

Additional Information (if applicable)

n/a

Steps to reproduce

My dietpi.txt has this:

#------------------------------------------------------------------------------------------------------
##### Misc DietPi program settings #####
#------------------------------------------------------------------------------------------------------
# DietPi-Survey: 1=opt in | 0=opt out | -1=ask on first call
# - https://dietpi.com/docs/dietpi_tools/#miscellaneous (see tab 'DietPi Survey')
SURVEY_OPTED_IN=0

Expected behaviour

No data is uploaded

Actual behaviour

I did spot "curl" command uploading survey data at the very end of the automatic set-up process!

Extra details

Joulinar commented 1 year ago

We upload an empty file only. There are no data inside.

ngrigoriev commented 1 year ago

My bad, sorry. I saw the command but I could not find any traces or logs. Maybe DietPi should log this somewhere under /var/tmp/dietpi/ and dump what gets uploaded.

However, even uploading of the empty file can still be considered a violation of privacy. My public IP address leaks, the number of installations leaks, the version and probably the hardware model leak...The way I understand OPTOUT is that no information is transmitted at all.

Joulinar commented 1 year ago

let me link a similar discussion https://github.com/MichaIng/DietPi/issues/5533#issuecomment-1141404799

MichaIng commented 1 year ago

Aside of the client IP of course, I'm not sure if there are any other meta data sent with SSH (compares to HTTP). However, while it's something one needs to trust, we do not log anything but only the IP when authentication failed (to be able to block via fail2ban blackhole route).

As mentioned in the other issue, getting the whole DietPi system count (not tied to how many systems per user or IP) is actually something beneficial for us.

Note that any dietpi-survey only runs after dietpi-software installs and dietpi-update runs, which imply a lot of other network connections with protocols which naturally send a lot of meta data (like HTML) to APT servers, DNS servers, NTP servers, GitHub and others, depending on (re)installed software titles. While I understand that one wants to have a principle followed, I do still not see a real privacy issue for an online server system.

ngrigoriev commented 1 year ago

IP, number of installations (potentially). In many jurisdictions, the caller's IP address is considered PII, although it is often a grey zone. In my professional practice, I prefer to think it is PII unless I know it is an internal address or belongs to a business.

Updates is a different story. One can argue that automatic updates are intentional, you do not have to do it. If you want a completely air-gap'ed installation, you can make your own mirror and update from there. Same goes for FTP, for DNS etc.

And, again, I want to make sure you do not get me wrong, I am not implying that you may not handle this information properly or may misuse it...When dealing with PII it is usually easier not to touch things you do not absolutely need. Inviting the people to fill a survey after the installation is one thing, allowing them to automatically opt-in is fine too, not allowing people to opt-out completely puts you in the same boat with some large not-to-be-named ones ;)

MichaIng commented 1 year ago

not allowing people to opt-out completely puts you in the same boat with some large not-to-be-named ones ;)

Well, intransparent sending a bunch of usage data via various closed source hidden background services, which cannot be completely blocked without breaking essential parts of the product, is probably not the same boat than sending an empty ping along with an explicit product update or software install. But I get your point.

On the first reimplementation of the survey, there were indeed three options, opt-in/send-data, opt-out/do-nothing and purge-data. So, when opting out, there was no empty file uploaded, while doing this was an extra option to overwrite uploaded data with the empty file and that way erase it. Deleting a file via SFTP isn't possible without granting the user execute bit (directory listing) permissions, which we wanted to avoid. Merging the opt-out and purge-data was at first only done for simplicity in code and for users. Now however the overall DietPi system count is quite an important stat for us, since "relevance" needs to be proven in some cases, for public mentions and e. g. a Wikipedia article (which is still in draft as Wikipedia reviewers are hard to get convinced, when they do not use it themselves 😉). When having the opted-in counts, it's only 14 % of that, and that 14 % are opted-in is another interesting stat. Sadly I'm not aware of a way for counting the overall DietPi systems without implying that at least the public IP is sent.

ngrigoriev commented 1 year ago

Can't this data be based on the number of image downloads instead? With some factor applied. It won't be worse than collecting the public IP addresses because you can't really say if I re-imaged the same board 5 times last week or actually installed it on 5 different ones.

Or having a short survey on the most important documentation (or download) page which is super easy and harmless (do you like DietPi: yes/no; are you using DietPi: yes/no; comments: ...). And again, this will be a sample, by now you can probably estimate the ratio between the # of image downloads and the installs. And if you release more up-to-date images more often, you will get more dynamic picture.

Or, maybe, just introduce a third option for OPTOUT and make it clear: opt-in sends this data (document what it sends), report-installation sends an empty request for statistical purposes only, opt-out sends nothing. Observe the evolution of the data over time using the previous stats to estimate how many installs are complete opt-outs.

We are discussing ethical data collection, I think.

MichaIng commented 1 year ago

Coincidentally at work we were just made aware of a wave of warning letter sent by law firms to a large number of German website operators which do load Google fonts from Google's CDN, arguing the dynamic IP is sent to Google which is known to collect and analyse data in general, conflicting the GDPR. There was one Bavarian court granting 100 € compensation, discussed mostly to be too harsh, since it is so common to use the Google fonts CDN, many website software do this by default, notably common Wordpress themes, and as it's the (mostly in Germany dynamic) public IP only.

Of course Google can potentially do much more with such info, connecting them with their otherwise collected data when the same clients/IPs connect to their various other services and websites, Google Analytics, reCAPTCHA, ads, search, Chrome data etc, and more meta data is sent with such a request as well. In turn I just removed the use of Google fonts CDN from our docs: https://github.com/MichaIng/DietPi-Docs/commit/773717b The MkDocs material theme uses it by default as well.

Not the same topic but goes into the same direction. We are not Google and neither log IPs on HTTP connections nor SFTP connections. However, I see that it is about principles for any data that is not necessary to send, e.g. for visiting a website or doing a download from that website/host.

We'll discuss this topic internally on our meeting tonight.