benbusby / whoogle-search

A self-hosted, ad-free, privacy-respecting metasearch engine
https://pypi.org/project/whoogle-search/
MIT License
9.58k stars 945 forks source link

[BUG] MarkupResemblesLocatorWarning #967

Closed glitsj16 closed 1 year ago

glitsj16 commented 1 year ago

Describe the bug After running a search I see the below on the command line:

/home/glitsj16/whoogle-search/app/utils/results.py:99: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  element.replace_with(BeautifulSoup(

To Reproduce Steps to reproduce the behavior:

  1. git clone https://github.com/benbusby/whoogle-search.git
  2. cd whoogle-search
  3. python3 -m venv venv
  4. source venv/bin/activate
  5. pip install -r requirements.txt
  6. ./run
  7. Run a search in the web interface
  8. See the above message in CLI

Deployment Method

Version of Whoogle Search

Desktop (please complete the following information):

Additional context $ pacman -Q python python 3.10.9-1

$ python3 -um app --debug ``` * Serving Flask app 'app' * Debug mode: on WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on http://127.0.0.1:5000 Press CTRL+C to quit * Restarting with stat * Debugger is active! * Debugger PIN: 117-182-872 /home/glitsj16/whoogle-search/app/utils/results.py:99: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. element.replace_with(BeautifulSoup( 127.0.0.1 - - [06/Mar/2023 17:47:52] "POST /search HTTP/1.1" 200 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/search.3e5a8ad9.css HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/logo.72c3bd56.css HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/input.61ccbb50.css HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/dark-theme.b0749774.css HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/header.978026e5.css HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/header.a12e0a24.js HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/autocomplete.1661f315.js HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/utils.b8afbbaa.js HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/keyboard.890853c5.js HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/build/currency.3dde589d.js HTTP/1.1" 304 - 127.0.0.1 - - [06/Mar/2023 17:47:53] "GET /static/img/favicon.ico HTTP/1.1" 304 - ^C ```
glitsj16 commented 1 year ago

UPDATE

Did some digging and I think this is new behaviour in Beautiful Soup. When I drop the requirement from the current beautifulsoup4==4.11.2 to the former beautifulsoup4==4.10.0, there is no such warning.

Looking at the changelog, there's mention of:

* Issue a warning when an HTML parser is used to parse a document that
  looks like XML but not XHTML. [bug=1939121]

Here's the relevant bug report.

I'm not sure what to make of the warning, but it seems pretty harmless. I could live with it, but I'm running monitoring shell scripts on my private whoogle-search instances and this keeps triggering alerts. For now I've added a small patch to results.py to silence this particular class of warnings:


--- a/app/utils/results.py
+++ b/app/utils/results.py
@@ -8,6 +8,10 @@
 import urllib.parse as urlparse
 from urllib.parse import parse_qs
 import re
+from bs4 import BeautifulSoup, MarkupResemblesLocatorWarning
+import warnings
+
+warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)

 SKIP_ARGS = ['ref_src', 'utm']
 SKIP_PREFIX = ['//www.', '//mobile.', '//m.', 'www.', 'mobile.', 'm.']

Hope this helps...

benbusby commented 1 year ago

Thanks for the info! I just implemented the solution you described for now, since the warnings are indeed pretty harmless.

mnalis commented 9 months ago

(FYI reported upstream at https://bugs.launchpad.net/beautifulsoup/+bug/2052988)