alligo / joomla-data-mining-and-machine-learning

Joomla! CMS Data Mining SQL Queries examples. Useful to extract data for analysis on external tools
MIT License
1 stars 0 forks source link

Private information on exported files for Data Mining / Machine learning (refs: GDPR and LGPD) #5

Open fititnt opened 3 years ago

fititnt commented 3 years ago

As far as the default SQL queries and documentation on the joomla-data-mining-and-machine-learning would publish, mostly

  1. e-mails, the username and (not enforced by Joomla, but some users could put the real ones) the name.
  2. From users that created one account on the site AND planned to create content. (but since we're also allow the joomla-users.sql #4, the full table would be there)
  3. PLANNED, but not implemented yet
    • Some special cases, like the User Action Logs (not created yet), could also be used to detect fraud or misbehavior.)
    • Strategies to process server access logs (like Apache and NGinx); these ones can at least have IP of user (this can be used in case of fraud detection).

With all this in mind, while for private uses allow output the full name, email, and IP, I think as the default output we should at least do reasonable ways to not simply expose identifiable data.

Affected SQL exported files

Note: this only contains data at the time of this issue is written.

joomla-users.sql

See this comment https://github.com/alligo/joomla-data-mining-and-machine-learning/issues/4#issuecomment-765881168. The joomla-users.sql v1.1, for now, only by default hides the user_name, while both user_username and user_email are still there.

joomla-content-*.sql

All tables that output articles also mention user. If some way to abstract user is used on joomla-user.sql, the default strategy should be consistent with the other ones.

Strategies to mitigate by default expose private information

[full anonymization] Manually crafted identifier by project, and keep private the references

Maybe the perfect ideal solution for serious projects is don't use hash at all based on any personal information, since hashs in some special cases could could be used to reconstruct original data (more explained next)

On this strategy, the person who have to export the dataset, would specially craft some specific table that have non-reversible mapping between the user_id and whatever is the anonymized identifier. This is likely to be considered full anonymization, not Pseudonymization.

But for this project, maybe we just document that this is an option.

Just use the user.id

Maybe one strategy could be simply use whatever user_id` the site already is using.

Hash based pseudonymization

Pseudonymization can be an good default strategy.