alligo / joomla-data-mining-and-machine-learning

Joomla! CMS Data Mining SQL Queries examples. Useful to extract data for analysis on external tools
MIT License
1 stars 0 forks source link

NGinx access log convertion to CSV format #6

Open fititnt opened 3 years ago

fititnt commented 3 years ago

After data from the database itself (and not considering external sources like Google Analytics and Google Search Console), one way to extract information from access logs from NGinx server (and later Apache server), seems a common need.

Both for Apache and Nginx access log files, the common data mining programs do not have some native importer. One quick and dirty way to do it would be open with LibreOffice using space as file separator, and ignoring the datetime inside [] that it breaks in two coluns, actually works somewhat OK. BUT LibreCalc, like Excel, have limitation of 1 million of lines, and sometimes this is not hard for a busy site, in special if each page access (like images, css and JS) that a single page can have more than 100.

On a quick look, I did not found simply ways to just do a quick conversion. Some tools like the fantastic goaccess (but also there is nginxtop, and other tools like this) are able to parse NGinx/Apache files, but they export feature is for already agregated result (in other words, these tools themselves do all the calculation, they don't allow simply convert Apache and NGinx access file to something to work on other tools.

fititnt commented 3 years ago

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. http://regex.info/blog/2006-09-15/247

Oh boy

fititnt commented 3 years ago

Maybe I will eventually move the nginxlogs2csv (and later likely an apachelogs2csv) to an dedicated GitHub repository.

But for now, I understand why is hard to find more than a bunch of regexes and small scripts to parse NGinx and Apache logs: there is a not of difference between implementations. I think that I will even, do more than one strategy of parsing, like one to literally fallback to just split the IP and the date, as simply recommend to the people change the script itself would require then not only know some python, but know python and Regex.

And just some places that have regex ONLY for IPv4 and IPV6 (one of the features of each line) already is bigger than the initial regex easier to find when looking for this subject (that, by the way, failed on my test case, ::1).