alligo / joomla-data-mining-and-machine-learning

Joomla! CMS Data Mining SQL Queries examples. Useful to extract data for analysis on external tools
MIT License
1 stars 0 forks source link
data-analysis data-mining joomla

Data Mining and Machine Learning references for Joomla! CMS [v0.4]

The most common use case of this project is that your organization (or clients of your company) have one or more websites SO BIG they help to understand how the content is organized and then how to generate more value.

PROTIP 1: While new people to both machine learning and usage in the real world are likely to assume the value to deliver with data mining for large websites is forecasting and replicating success cases, often the best value is discovering patterns already existing and fixing small issues with high impact. In other words: you learn how humans produce/optimize content (even if it means ask them instead of analize the data data), then discover outliers that are not as perfect. The next step is, based on data visualization, help the humans to self-improvement with new feedback.

PROTIP 2: if you're really into forecasting from the PROTIP 1, machine learning can (and often will) works better even with small (but well representative) list of items (think like 100 rows) than 1.000's to 100.000's but poorly prepared data.



1. Overview

TL;DR of Concepts

  • Joomla! is an award-winning content management system (CMS), which enables you to build web sites and powerful online applications.
  • Data mining (in Portuguese, Mineração de dados) is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
  • The joomla-data-mining-and-machine-learning gives an general idea of how to do data mining and machine learning with websites, but dedicate special attention to implementation of non-generic tasks, like SQL queries, to an typical Joomla! CMS installation.

What this reference is NOT:

2. Data extraction, preparation and cleaning

2.1. Data extraction from the Joomla site database ("SQL queries")

2.1.1. Output for the reference SQL queries

Each data mining project is unique, not just by project per organization, but by point in time. Yet this project contains examples of data output when using the implementation with the SQL queries.

2.2. Data extraction from Joomla server logs ("Apache/NGinx access logs")

Note: this section is a minimal draft. Since the implementation is more generic than the use with Joomla CMS, it may not be implemented at all on future releases. But it still mentioned here, since is an potential source of data.

2.3. Data extraction from Google Analytics and Google Search Console

Note: since this topic is generic than the use with only Joomla CMS websites, it's here more as a quick reference. In fact, it would be incomplete to at least not mention one source of data that is very likely to be already used on average sites in production on a reference related to data mining for Joomla.

Both Google Analytics and Google Search Console can be used a source of data. While both (in special for very big sites with pages with low number of access) are likely to not be as structured to get an full picture as the context extracted from the database export, it still allow to compare data.

2.3.1. Google Analytics

Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs

  1. Go to https://analytics.google.com/. Select the account of the website
  2. On the sidebar, click Behavior (in Portuguese, Comportamento)
  3. Click on Site Content (in Portuguese, Conteúdo do site)
  4. Click on All Pages (in Portuguese, Todas as Páginas)
  5. Change how many list items to show; recommented: 5000.
    • Exported items are exactly the ones previewed on screen.
    • When do exist more than 5000 items, click to see next page. Each page needs to be exported.
  6. Personalize the date range to show the table on the top of the screen, right side; recommended: at least a few years ago until current date.
    • Keep in mind that older pages may not be accessed by users on recent months. For big sites, this means you will miss parts of your site
    • But also keep in mind that older dates will return inexistent pages today
  7. On the top of the screen, right side, click Export.
    • Excel (XLSX) or Google Spreadsheets can be a good format to start.

2.3.2. Google Search Console

The steps are similar to Google Analytics, but with the limitation of 1000 URLs. Also an single Excel (XLSX) exported file have more content from Google Analytics and is

Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs

  1. Go to https://search.google.com/search-console/. Select the account of the website
  2. On the sidebar, click Performance (in Portuguese, Desempenho)
  3. There is no need to change how many itens to export (the maximum, non-configurable, is 1000)
  4. Personalize the date rang;, the older you can get, the better.
  5. On the top of the screen, right side, click Export.
    • Excel (XLSX) or Google Spreadsheets can be a good format to start.

2.4. Helper scripts

At the moment the bin/, even for an working draft, have no significant content.

3. Data Mining and Machine learning ("understanding the data")

At the moment this section is an draft.

3.1. Know limitations

  1. Data extraction from the Joomla site database ("SQL queries") fields named  content_hits, category_hits and tag_hits, while very useful to get an   overview without need to relate with other sources (like Google Analytics   and Apache/Nginx access logs) are affected by caching mechanisms (both standard   Joomla Caching and full page caches like Varnish). Do not ask to disable   caches as this affects performance, in particular for big sites. But, as rule   of trump, when you confirm that cache do affect these variables:
    • The real value is equal or higher than exported one (sometimes much higher)
      • Pages with disproportionate number of access in short times will have           significantly less hits. (think articles that viralize)
      • Pages with infrequent (but constant access over the same day or week)           may not be affected at all by cache.
    • The content_hits , category_hits and tag_hits still worth to be       used both for small benchmarks and (for serious reports) needs extra       testing for detect the impact
  2. Data from Google Analytics and Google Search Console are limited to   page URls most accessed by users and evolution over time (including URLs changes)
    • (TODO: explain more on this)

3.2. Softwares

3.2.1. LibreOffice Calc / Microsoft Excel / Google Spreadsheet

3.2.2. Orange Data Miner

3.2.3. Weka, "The workbench for machine learning"

4. Patterns often found on (large) websites to be aware

TODO: this entire section is an working draft

4.1 Spammers

For sites that allow users to create accounts (even if there is no registration link on the site, but the CMS allows create accounts) the exported data will show trends of non-natural humans registered.

4.2 Untagged items

TODO: add content

License

MIT License.

SPDX-License-Identifier: MIT