The most common use case of this project is that your organization (or clients of your company) have one or more websites SO BIG they help to understand how the content is organized and then how to generate more value.
PROTIP 1: While new people to both machine learning and usage in the real world are likely to assume the value to deliver with data mining for large websites is forecasting and replicating success cases, often the best value is discovering patterns already existing and fixing small issues with high impact. In other words: you learn how humans produce/optimize content (even if it means ask them instead of analize the data data), then discover outliers that are not as perfect. The next step is, based on data visualization, help the humans to self-improvement with new feedback.
PROTIP 2: if you're really into forecasting from the PROTIP 1, machine learning can (and often will) works better even with small (but well representative) list of items (think like 100 rows) than 1.000's to 100.000's but poorly prepared data.
TL;DR of Concepts
- Joomla! is an award-winning content management system (CMS), which enables you to build web sites and powerful online applications.
- Data mining (in Portuguese, Mineração de dados) is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
- The joomla-data-mining-and-machine-learning gives an general idea of how to do data mining and machine learning with websites, but dedicate special attention to implementation of non-generic tasks, like SQL queries, to an typical Joomla! CMS installation.
What this reference is NOT:
content_created_date
can
predict content_hits
"_ is the same as one machine discover that as
older as one content is, more likely to have an total number of views.
But is this useful?Each data mining project is unique, not just by project per organization, but by point in time. Yet this project contains examples of data output when using the implementation with the SQL queries.
Note: this section is a minimal draft. Since the implementation is more generic than the use with Joomla CMS, it may not be implemented at all on future releases. But it still mentioned here, since is an potential source of data.
Note: since this topic is generic than the use with only Joomla CMS websites, it's here more as a quick reference. In fact, it would be incomplete to at least not mention one source of data that is very likely to be already used on average sites in production on a reference related to data mining for Joomla.
Both Google Analytics and Google Search Console can be used a source of data. While both (in special for very big sites with pages with low number of access) are likely to not be as structured to get an full picture as the context extracted from the database export, it still allow to compare data.
Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs
Behavior
(in Portuguese, Comportamento
)Site Content
(in Portuguese, Conteúdo do site
)All Pages
(in Portuguese, Todas as Páginas
)Export
.
Excel (XLSX)
or Google Spreadsheets
can be a good format to start.The steps are similar to Google Analytics, but with the limitation of 1000
URLs. Also an single Excel (XLSX)
exported file have more content from
Google Analytics and is
Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs
Performance
(in Portuguese, Desempenho
)Export
.
Excel (XLSX)
or Google Spreadsheets
can be a good format to start.At the moment the bin/, even for an working draft, have no significant content.
At the moment this section is an draft.
content_hits
, category_hits
and tag_hits
, while very useful to get an
overview without need to relate with other sources (like Google Analytics
and Apache/Nginx access logs) are affected by caching mechanisms (both standard
Joomla Caching and full page caches like Varnish). Do not ask to disable
caches as this affects performance, in particular for big sites. But, as rule
of trump, when you confirm that cache do affect these variables:
content_hits
, category_hits
and tag_hits
still worth to be
used both for small benchmarks and (for serious reports) needs extra
testing for detect the impactTODO: this entire section is an working draft
For sites that allow users to create accounts (even if there is no registration link on the site, but the CMS allows create accounts) the exported data will show trends of non-natural humans registered.
TODO: add content
SPDX-License-Identifier: MIT