fisharebest / webtrees

Online genealogy
https://webtrees.net
GNU General Public License v3.0
454 stars 298 forks source link

2.1.15 Uncaught HttpBadRequestException: Invalid UTF-8 characters caused by old searchbot requests #4712

Open FrankWarius opened 1 year ago

FrankWarius commented 1 year ago

I have a lot 500 errors e.g. from bing using old links with umlaut https://wbt.warius.info/tree/Warius/branches/lammh%C3%B6fer can You please redirect to an 404 error?

Uncaught Fisharebest\Webtrees\Http\Exceptions\HttpBadRequestException: Invalid UTF-8 characters in request in D:\web\WT21Git\webtrees\app\Validator.php:67 Stack trace:

0 [internal function]: Fisharebest\Webtrees\Validator::Fisharebest\Webtrees{closure}('lammh\xF6fer', 'surname')

1 D:\web\WT21Git\webtrees\app\Validator.php(71): array_walk_recursive(Array, Object(Closure))

2 D:\web\WT21Git\webtrees\app\Validator.php(85): Fisharebest\Webtrees\Validator->__construct(Array, Object(Nyholm\Psr7\ServerRequest), 'UTF-8')

3 D:\web\WT21Git\webtrees\app\Http\Middleware\HandleExceptions.php(155): Fisharebest\Webtrees\Validator::attributes(Object(Nyholm\Psr7\ServerRequest))

4 D:\web\WT21Git\webtrees\app\Http\Middleware\HandleExceptions.php(99): Fisharebest\Webtrees\Http\Middleware\HandleExceptions->httpExceptionResponse(Object(Nyholm\Psr7\ServerRequest), Object(Fisharebest\Webtrees\Http\Exceptions\HttpBadRequestException))

5 D:\web\WT21Git\webtrees\vendor\oscarotero\middleland\src\Dispatcher.php(136): Fisharebest\Webtrees\Http\Middleware\HandleExceptions->process(Object(Nyholm\Psr7\ServerRequest), Object(Middleland\Dispatcher))

fisharebest commented 1 year ago

This isn't a problem on the demo server. The URL is valid UTF-8 and is recognised OK.

https://dev.webtrees.net/demo-dev/tree/demo/branches/lammh%C3%B6fer

My guess is that the validation error is occurring on one of the HTTP request headers.

Control panel -> Server information -> PHP Variables.

Are there any "interesting" $_SERVER variables? Perhaps your server is adding geo-lookup headers, and using invalid characters here?

FrankWarius commented 1 year ago

I don't think that there are added headers, it's nativ IIS10

Variable | Value -- | -- $_COOKIE['__Secure-WT-ID'] | 2e24ba5eb497d1bf0ec0132bacf8f5c5 $_SERVER['_FCGI_X_PIPE_'] | \\.\pipe\IISFCGI-1e736672-8688-4dea-8879-a9feb4557a83 $_SERVER['PHPRC'] | C:\PHPEnv\PHPini\ $_SERVER['PHP_FCGI_MAX_REQUESTS'] | 10000 $_SERVER['ALLUSERSPROFILE'] | C:\ProgramData $_SERVER['APPDATA'] | C:\Windows\system32\config\systemprofile\AppData\Roaming $_SERVER['APP_POOL_CONFIG'] | C:\inetpub\temp\apppools\WTProd\WTProd.config $_SERVER['APP_POOL_ID'] | WTProd $_SERVER['CommonProgramFiles'] | C:\Program Files\Common Files $_SERVER['CommonProgramFiles(x86)'] | C:\Program Files (x86)\Common Files $_SERVER['CommonProgramW6432'] | C:\Program Files\Common Files $_SERVER['COMPUTERNAME'] | SRV23-5DP-DE $_SERVER['ComSpec'] | $_SERVER['DriverData'] | $_SERVER['LOCALAPPDATA'] | $_SERVER['NUMBER_OF_PROCESSORS'] | 4 $_SERVER['OS'] | Windows_NT $_SERVER['Path'] | $_SERVER['PATHEXT'] | .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC $_SERVER['PROCESSOR_ARCHITECTURE'] | AMD64 $_SERVER['PROCESSOR_IDENTIFIER'] | Intel64 Family 6 Model 85 Stepping 4, GenuineIntel $_SERVER['PROCESSOR_LEVEL'] | 6 $_SERVER['PROCESSOR_REVISION'] | 5504 $_SERVER['ProgramData'] | C:\ProgramData $_SERVER['ProgramFiles'] | C:\Program Files $_SERVER['ProgramFiles(x86)'] | C:\Program Files (x86) $_SERVER['ProgramW6432'] | C:\Program Files $_SERVER['PSModulePath'] | $_SERVER['PUBLIC'] | $_SERVER['SystemDrive'] | C: $_SERVER['SystemRoot'] | C:\Windows $_SERVER['TEMP'] | C:\Windows\TEMP $_SERVER['TMP'] | C:\Windows\TEMP $_SERVER['USERDOMAIN'] | WORKGROUP $_SERVER['USERNAME'] | SRV23-5DP-DE$ $_SERVER['USERPROFILE'] | C:\Windows\system32\config\systemprofile $_SERVER['windir'] | C:\Windows $_SERVER['ORIG_PATH_INFO'] | /index.php $_SERVER['URL'] | /index.php $_SERVER['SERVER_SOFTWARE'] | Microsoft-IIS/10.0 $_SERVER['SERVER_PROTOCOL'] | HTTP/1.1 $_SERVER['SERVER_PORT_SECURE'] | 1 $_SERVER['SERVER_PORT'] | 443 $_SERVER['SERVER_NAME'] | wbt.warius.info $_SERVER['SCRIPT_NAME'] | /index.php $_SERVER['SCRIPT_FILENAME'] | D:\web\WT21Git\webtrees\index.php $_SERVER['REQUEST_URI'] | /admin/information $_SERVER['REQUEST_METHOD'] | GET $_SERVER['REMOTE_USER'] | no value $_SERVER['REMOTE_PORT'] | 62907 $_SERVER['REMOTE_HOST'] | $_SERVER['REMOTE_ADDR'] | $_SERVER['QUERY_STRING'] | no value $_SERVER['PATH_TRANSLATED'] | D:\web\WT21Git\webtrees\index.php $_SERVER['LOGON_USER'] | no value $_SERVER['LOCAL_ADDR'] | 85.215.178.206 $_SERVER['INSTANCE_META_PATH'] | /LM/W3SVC/1 $_SERVER['INSTANCE_NAME'] | WTPROD $_SERVER['INSTANCE_ID'] | 1 $_SERVER['HTTPS_SERVER_SUBJECT'] | CN=wbt.warius.info $_SERVER['HTTPS_SERVER_ISSUER'] | C=US, O=Let's Encrypt, CN=R3 $_SERVER['HTTPS_SECRETKEYSIZE'] | 2048 $_SERVER['HTTPS_KEYSIZE'] | 256 $_SERVER['HTTPS'] | on $_SERVER['GATEWAY_INTERFACE'] | CGI/1.1 $_SERVER['DOCUMENT_ROOT'] | D:\web\WT21Git\webtrees $_SERVER['CONTENT_TYPE'] | no value $_SERVER['CONTENT_LENGTH'] | 0 $_SERVER['CERT_SUBJECT'] | no value $_SERVER['CERT_SERIALNUMBER'] | no value $_SERVER['CERT_ISSUER'] | no value $_SERVER['CERT_FLAGS'] | no value $_SERVER['CERT_COOKIE'] | no value $_SERVER['AUTH_USER'] | no value $_SERVER['AUTH_PASSWORD'] | no value $_SERVER['AUTH_TYPE'] | no value $_SERVER['APPL_PHYSICAL_PATH'] | D:\web\WT21Git\webtrees\ $_SERVER['APPL_MD_PATH'] | /LM/W3SVC/1/ROOT $_SERVER['IIS_UrlRewriteModule'] | 7,1,1993,2351 $_SERVER['UNENCODED_URL'] | /admin/information $_SERVER['IIS_WasUrlRewritten'] | 1 $_SERVER['HTTP_X_ORIGINAL_URL'] | /admin/information $_SERVER['HTTP_SEC_FETCH_USER'] | ?1 $_SERVER['HTTP_SEC_FETCH_SITE'] | same-origin $_SERVER['HTTP_SEC_FETCH_MODE'] | navigate $_SERVER['HTTP_SEC_FETCH_DEST'] | document $_SERVER['HTTP_UPGRADE_INSECURE_REQUESTS'] | 1 $_SERVER['HTTP_DNT'] | 1 $_SERVER['HTTP_USER_AGENT'] | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0 $_SERVER['HTTP_TE'] | trailers $_SERVER['HTTP_REFERER'] | https://wbt.warius.info/admin $_SERVER['HTTP_HOST'] | wbt.warius.info $_SERVER['HTTP_COOKIE'] | __Secure-WT-ID=2e24ba5eb497d1bf0ec0132bacf8f5c5 $_SERVER['HTTP_ACCEPT_LANGUAGE'] | de,en-US;q=0.7,en;q=0.3 $_SERVER['HTTP_ACCEPT_ENCODING'] | gzip, deflate, br $_SERVER['HTTP_ACCEPT'] | text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 $_SERVER['HTTP_CONTENT_LENGTH'] | 0 $_SERVER['HTTP_CONNECTION'] | close $_SERVER['FCGI_ROLE'] | RESPONDER $_SERVER['PHP_SELF'] | /index.php $_SERVER['REQUEST_TIME_FLOAT'] | 1673171317.277 $_SERVER['REQUEST_TIME'] | 1673171317
fisharebest commented 1 year ago

Perhaps you could add some debug code here:

https://github.com/fisharebest/webtrees/blob/684a6f87b3ab6b4cc21e962b050c95eb9c0cea91/app/Validator.php#L63-L68

Write $key and $value to a log file. (If they contain invalid UTF characters, you probably cannot write them to the database).

FrankWarius commented 1 year ago

I added in line 67 $x = preg_match('//u', $value, $match); throw new HttpBadRequestException('Invalid UTF-8 characters in request (' . $value . ')'); and use XDebug (on 2.1.15) $match: array(0) $value: "P�ch" 'P\xE4ch' $x: false

fisharebest commented 1 year ago

If this is CP1252, then \xE4 is ä - Päch

Can you add both $value and $key to the debug?

FrankWarius commented 1 year ago

$value: "P�ch" 'P\xE4ch' $key: "surname" url now: https://wbt.warius.info/tree/Warius/branches/P%C3%A4ch

FrankWarius commented 1 year ago

It's pretty URL on IIS related http://dev.warius.info/index.php?route=%2Ftree%2Ftree1%2Fbranches%2FP%25C3%25A4ch&soundex_dm=0&soundex_std=0 works

FrankWarius commented 1 year ago

Anforderungs-URL: https://wbt.warius.info/tree/Warius/branches/P%C3%A4ch Anforderungsmethode: GET Statuscode: 500 Remoteadresse: 85.215.178.206:443 Referrer-Richtlinie: strict-origin-when-cross-origin cache-control: no-store, no-cache, must-revalidate content-encoding: gzip content-length: 649 content-type: text/html; charset=UTF-8 date: Sun, 08 Jan 2023 15:51:58 GMT expires: Thu, 19 Nov 1981 08:52:00 GMT pragma: no-cache server: Microsoft-IIS/10.0 vary: Accept-Encoding x-powered-by: PHP/8.1.14 :authority: wbt.warius.info :method: GET :path: /tree/Warius/branches/P%C3%A4ch :scheme: https accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9 accept-encoding: gzip, deflate, br accept-language: de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6 cache-control: no-cache cookie: __Secure-WT-ID=9f965da74fe2d009df90a681f0abb14e dnt: 1 pragma: no-cache sec-ch-ua: "Not?A_Brand";v="8", "Chromium";v="108", "Microsoft Edge";v="108" sec-ch-ua-mobile: ?0 sec-ch-ua-platform: "Windows" sec-fetch-dest: document sec-fetch-mode: navigate sec-fetch-site: none sec-fetch-user: ?1 upgrade-insecure-requests: 1 user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.76

FrankWarius commented 1 year ago

It's an issue of IIS URL Rewrite module wich decode the REQUEST_URI when rewriting.

XDebug shows the following server variables: REQUEST_URI: "/tree/Warius/branches/P�ch" which has the wrong code page UNENCODED_URL: "/tree/Warius/branches/P%C3%A4ch" which should be used HTTP_X_ORIGINAL_URL: "/tree/Warius/branches/P%C3%A4ch" which is also correct

Webtrees should use UNENCODED_URL for IIS

I can also change the rewrite rule but I need some information the actual rwrite action is <action type="Rewrite" url="index.php" appendQueryString="true" />

I can add the unencoded_url to index.php but don't now how webtrees need it <action type="Rewrite" url="index.php?{UNENCODED_URL}" appendQueryString="false" />?

FrankWarius commented 1 year ago

fixed by adding <set name="REQUEST_URI" value="{UNENCODED_URL}" /> to the IIS10 URL Rewirte Rule serverVariables

complete rule: <rule name="Webtrees Rewrite" enabled="true" stopProcessing="true"> <match url="^" ignoreCase="false" /> <conditions logicalGrouping="MatchAll" trackAllCaptures="false"> <add input="{REQUEST_FILENAME}" matchType="IsDirectory" negate="true" /> <add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" /> </conditions> <action type="Rewrite" url="index.php" appendQueryString="true" logRewrittenUrl="false" /> <serverVariables> <set name="REQUEST_URI" value="{UNENCODED_URL}" /> </serverVariables> </rule>

should we update the documentation?

fisharebest commented 1 year ago

There are two parts to this issue.

1) webtrees detects this invalid character, and tries to give a 400 Bad Request response.

Currently, we check that the headers contain valid UTF8. I think we should be more strict. The headers should be 7-bit ASCII

2) the error page generates a similar error - and this gives a 500 response.

This needs to be fixed, so that we can give the correct 400 response and error message.

FrankWarius commented 1 year ago

2 Notes:

  1. it is no longer an old search bot request issue. The error (on IIS, pretty-URL) occurs when querying family branches with names containing umlauts. https://wbt.warius.info/tree/Warius/branches/P%C3%A4ch?soundex_dm=0&soundex_std=0
  2. in each call of Validator.php __construct all DB parameters from config.ini.php are checked again (about 10 iterations until the error occurs) - The question arises whether this repetition within a session is necessary. - But more important is whether we want to restrict the DB attributes - especially dbpass - to ASCII 7.