eldy / AWStats

AWStats Log Analyzer project (official sources)
https://www.awstats.org
369 stars 120 forks source link

GeoIP2 trys to write characters outside of ISO-8859 #127

Closed manuelm closed 5 years ago

manuelm commented 5 years ago

This is related to #69 but targets GeoIP2 modules. Mainly GeoIP2::City. GeoIP2 names are UTF-8 encoded. Therefor they can contain characters outside of the ISO-8859 range. Since AWStats data files are stored in ISO the GeoIP output has to be escaped. As in #69 I suggest html escaping them, either using HTML::Entities or a stripped down version which only escapes high bytes:

sub html_escape {
  my $s = shift;
  $s =~ s/([^\x00-\x7F])/sprintf "&#x%X;", ord($1)/ge;
  return $s;
}

Regarding geoip2_city.pm the function call should be added in SectionWriteHistory_geoip2_city so that XMLEncodeForHisto($_) becomes XMLEncodeForHisto(html_escape($_)) (or XMLEncodeForHisto(encode_entities_numeric($_))).

For geoip2.pm this isn't strictly necessary as I'm not aware of any international countries names that contain characters outside the ISO range.

195.38.12.0 is an ipaddress for testing. GeoIP2::City tries to store PL_Zamość_Lublin. With escaping this gets converted to PL_Zamość_Lublin.

manuelm commented 5 years ago

Forgot about loading: Of course we also need to unescape the characters on loading/updating the data file. So e.g. in SectionReadHistory_geoip2_city we add decode_entities($field[0]); after @field=split(/\s+/,($xmlold?XMLDecodeFromHisto($_):$_));.

manuelm commented 5 years ago

Here is a better approach without the need of decoding and HTML::Entities as extra dependency. We simply store the escaped name in the hash:

--- b/wwwroot/cgi-bin/plugins/geoip2_city.pm.orig 2019-04-19 12:36:02.026552366 +0200
+++ a/wwwroot/cgi-bin/plugins/geoip2_city.pm  2019-04-19 12:35:52.906632209 +0200
@@ -29,6 +29,11 @@
 #use strict;
 no strict "refs";

+sub html_escape {
+ my $s = shift;
+ $s =~ s/([^\x00-\x7F])/sprintf "&#x%X;", ord($1)/ge;
+ return $s;
+}

 #-----------------------------------------------------------------------------
@@ -369,7 +374,7 @@
             my $countrycity=$record->country()->iso_code().'_'.$record->city()->name();
             $countrycity=~s/ /%20/g;
             if ($record->most_specific_subdivision()->name()) { $countrycity.='_'.$record->most_specific_subdivision()->name(); }
-            $_city_h{lc($countrycity)}++;
+            $_city_h{html_escape(lc($countrycity))}++;
         } else {
             $_city_h{'unknown'}++;
         }
@@ -409,7 +414,7 @@
             my $countrycity=($record->country()->iso_code()).'_'.$city;
             $countrycity=~s/ /%20/g;
             if ($record->most_specific_subdivision()->name()) { $countrycity.='_'.$record->most_specific_subdivision()->name(); }
-            $_city_h{lc($countrycity)}++;
+            $_city_h{html_escape(lc($countrycity))}++;
         } else {
             $_city_h{'unknown'}++;
         }