Data Mining and Machine Learning references for Joomla! CMS [v0.4]

The most common use case of this project is that your organization (or clients of your company) have one or more websites SO BIG they help to understand how the content is organized and then how to generate more value.

PROTIP 1: While new people to both machine learning and usage in the real world are likely to assume the value to deliver with data mining for large websites is forecasting and replicating success cases, often the best value is discovering patterns already existing and fixing small issues with high impact. In other words: you learn how humans produce/optimize content (even if it means ask them instead of analize the data data), then discover outliers that are not as perfect. The next step is, based on data visualization, help the humans to self-improvement with new feedback.

PROTIP 2: if you're really into forecasting from the PROTIP 1, machine learning can (and often will) works better even with small (but well representative) list of items (think like 100 rows) than 1.000's to 100.000's but poorly prepared data.

1. Overview
2. Data extraction, preparation and cleaning
3. Data Mining and Machine learning ("understanding the data")
- 3.1. Know limitations
- 3.2. Softwares
4. Patterns often found on (large) websites to be aware
- 4.1 Spammers
- 4.2 Untagged items

1. Overview

TL;DR of Concepts

Joomla! is an award-winning content management system (CMS), which enables you to build web sites and powerful online applications.

Data mining (in Portuguese, Mineração de dados) is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

The joomla-data-mining-and-machine-learning gives an general idea of how to do data mining and machine learning with websites, but dedicate special attention to implementation of non-generic tasks, like SQL queries, to an typical Joomla! CMS installation.

What this reference is NOT:

Do not mistake Data mining with web scraping.
- This project requires administrative access (like database access and server access logs).
Do not mistake this project by something that you "install" on an site
- Some people search for "Data mining WordPress" or "Data Mining Joomla" asking for something to install and automatically "enable data mining" "enable machine learning" on the website.
- Ok, stricly speaking, yes, one discussed source of data, Google Analytics, can be installed, but this project is have broader concept.
Do not confuse this guide with something whose final result runs automatically and give insights.
- While some parts can be automated and even individual softwares can bruteforce insights, an human would still need to glue different tools and discard irrelevant or obvious conclusions and focus on what can be useful
  - _"Wow, Orange Data Mining discoreverd that content_created_date can predict content_hits"_ is the same as one machine discover that as older as one content is, more likely to have an total number of views. But is this useful?

2. Data extraction, preparation and cleaning

2.1. Data extraction from the Joomla site database ("SQL queries")

Directory: sql/
Database engine: MySQL-like
- Replace 'JOSTABLEPREFIX_' with your real database prefix.
- For Joomla CMS sites using PostgreSQL or SQL Server/SQL Azure you need some small changes on syntax, but field names are granted to be compatible.

2.1.1. Output for the reference SQL queries

Directory: output/

Each data mining project is unique, not just by project per organization, but by point in time. Yet this project contains examples of data output when using the implementation with the SQL queries.

2.2. Data extraction from Joomla server logs ("Apache/NGinx access logs")

Note: this section is a minimal draft. Since the implementation is more generic than the use with Joomla CMS, it may not be implemented at all on future releases. But it still mentioned here, since is an potential source of data.

2.3. Data extraction from Google Analytics and Google Search Console

Note: since this topic is generic than the use with only Joomla CMS websites, it's here more as a quick reference. In fact, it would be incomplete to at least not mention one source of data that is very likely to be already used on average sites in production on a reference related to data mining for Joomla.

Both Google Analytics and Google Search Console can be used a source of data. While both (in special for very big sites with pages with low number of access) are likely to not be as structured to get an full picture as the context extracted from the database export, it still allow to compare data.

2.3.1. Google Analytics

Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs

Go to https://analytics.google.com/. Select the account of the website
On the sidebar, click Behavior (in Portuguese, Comportamento)
Click on Site Content (in Portuguese, Conteúdo do site)
Click on All Pages (in Portuguese, Todas as Páginas)
Change how many list items to show; recommented: 5000.
- Exported items are exactly the ones previewed on screen.
- When do exist more than 5000 items, click to see next page. Each page needs to be exported.
Personalize the date range to show the table on the top of the screen, right side; recommended: at least a few years ago until current date.
- Keep in mind that older pages may not be accessed by users on recent months. For big sites, this means you will miss parts of your site
- But also keep in mind that older dates will return inexistent pages today
On the top of the screen, right side, click Export.
- Excel (XLSX) or Google Spreadsheets can be a good format to start.

2.3.2. Google Search Console

The steps are similar to Google Analytics, but with the limitation of 1000 URLs. Also an single Excel (XLSX) exported file have more content from Google Analytics and is

Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs

Go to https://search.google.com/search-console/. Select the account of the website
On the sidebar, click Performance (in Portuguese, Desempenho)
There is no need to change how many itens to export (the maximum, non-configurable, is 1000)
Personalize the date rang;, the older you can get, the better.
On the top of the screen, right side, click Export.
- Excel (XLSX) or Google Spreadsheets can be a good format to start.

2.4. Helper scripts

At the moment the bin/, even for an working draft, have no significant content.

3. Data Mining and Machine learning ("understanding the data")

At the moment this section is an draft.

3.1. Know limitations

Data extraction from the Joomla site database ("SQL queries") fields named content_hits, category_hits and tag_hits, while very useful to get an overview without need to relate with other sources (like Google Analytics and Apache/Nginx access logs) are affected by caching mechanisms (both standard Joomla Caching and full page caches like Varnish). Do not ask to disable caches as this affects performance, in particular for big sites. But, as rule of trump, when you confirm that cache do affect these variables:
- The real value is equal or higher than exported one (sometimes much higher)
  - Pages with disproportionate number of access in short times will have significantly less hits. (think articles that viralize)
  - Pages with infrequent (but constant access over the same day or week) may not be affected at all by cache.
- The content_hits , category_hits and tag_hits still worth to be used both for small benchmarks and (for serious reports) needs extra testing for detect the impact
Data from Google Analytics and Google Search Console are limited to page URls most accessed by users and evolution over time (including URLs changes)
- (TODO: explain more on this)

3.2. Softwares

4. Patterns often found on (large) websites to be aware

TODO: this entire section is an working draft

4.1 Spammers

For sites that allow users to create accounts (even if there is no registration link on the site, but the CMS allows create accounts) the exported data will show trends of non-natural humans registered.

4.2 Untagged items

TODO: add content

License

MIT License.

SPDX-License-Identifier: MIT

alligo / joomla-data-mining-and-machine-learning

readme