LadybirdBrowser / ladybird

Truly independent web browser
https://ladybird.org
BSD 2-Clause "Simplified" License
18.38k stars 726 forks source link

Using address bar to search for a single word fails #142

Open hughdavenport opened 2 months ago

hughdavenport commented 2 months ago

When a search engine is enabled, if I type in the address bar a single word then the page loads as https://`thatword`/. However if I type multiple words it works as expected.

If I put in the following dbgln's

diff --git a/Ladybird/Qt/LocationEdit.cpp b/Ladybird/Qt/LocationEdit.cpp
index 7dc4336331..3a23be952c 100644
--- a/Ladybird/Qt/LocationEdit.cpp
+++ b/Ladybird/Qt/LocationEdit.cpp
@@ -38,6 +38,8 @@ LocationEdit::LocationEdit(QWidget* parent)
             search_engine_url = Settings::the()->search_engine().query_url;

         auto query = ak_string_from_qstring(text());
+        dbgln("query={}, engine_url={}", query, search_engine_url);
+        dbgln("sanitized_url={}", WebView::sanitize_url(query, search_engine_url));

         if (auto url = WebView::sanitize_url(query, search_engine_url); url.has_value())
             setText(qstring_from_ak_string(url->serialize()));

The I get the following in the logs:

294966.228 Ladybird(680721): query=a, engine_url=https://duckduckgo.com/?q={}
294966.228 Ladybird(680721): sanitized_url=https://a/
294966.280 RequestServer(680723:680748): ConnectionCache: Connection to https://a/ failed: Temporary failure in name resolution
294966.280 RequestServer(680723:680748): Request with a null socket finished for URL https://a/
294966.280 WebContent(680757): ResourceLoader: Failed load of: "https://a/", Error: Load failed, Duration: 45ms
294973.648 Ladybird(680721): query=a a, engine_url=https://duckduckgo.com/?q={}
294973.648 Ladybird(680721): sanitized_url=https://duckduckgo.com/?q=a%20a

So the issue is likely in WebView::sanitize_url

hughdavenport commented 2 months ago

Looks like it is due to the choice of automatically adding https:// if no scheme is present. See https://github.com/LadybirdBrowser/ladybird/blob/2f5cf8ac204a58dc2a6f722dd95015c6c2fb7a78/Userland/Libraries/LibWebView/URL.cpp#L65-L66

I guess another check for the validity of a URL is to check the host actually resolves?

hughdavenport commented 2 months ago

I had two morning sleep thoughts

joekohlsdorf commented 2 months ago

I had two morning sleep thoughts

  • a single word is usually not a valid host, as it doesn't have a valid subdomain.TLD format
  • but it could be valid on a local network
  • or an entry in the users hosts file

It looks like Chromium recognizes localhost as URL but it doesn't recognize anything from /etc/hosts. To not overcomplicate things I'd start with a list of words or regexes (to also match IPs) which shouldn't search.

hughdavenport commented 2 months ago

ah yeh, firefox is the same, I've done a bit more testing of various cases.

One example of my use-case is when I put ladybird into the address bar and have search enabled, I would expect (as a user) that it searches for the term ladybird rather than go to non-existent https://ladybird/.

For what it's worth, both chrome and firefox seem to do some checking of validity of TLD's with schemeless adresses with a second part., like foo.abc seems to try to go to foo.abc, but foo.abcd searches. And foo.def does not search. Ladybird attempts to go to https://foo.abc and https://foo.abcd respectively. .abc has a valid SOA record, and .def does not

I'll focus only on dotless domains from now on. For single words with dots in them, as a user of other browsers I actually find it annoying having it search for something with with a dot. It is usually a typo in my case, and when it searches then it isn't as easy just editing the the URL bar and changing some character(s), so I think the current behaviour of ladybird should stay..

It seems testing with username, password, and path segments for single word all get searched in chromium. In firefox it is the same except for having a password part it will always go to the host (for security reasons most likely). Weirdly having an empty path would go to the host (e.g. foo/ will go to https://foo/), while having a non empty path searches (e.g. foo/bar searches). If a port segment is present, then both chromium and firefox go to the host, except firefox will search if it also has a username but not a password (e.g. a@foo:123/bar will search, while both foo:123/bar and a:secret@foo:123/bar will go to the host`).

In my opinion (as a user), all of those cases should go to the host (even if non existent) as it is likely what the user intended. Information about URL's and URI's are available at https://en.wikipedia.org/wiki/URL#Syntax and https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax. If following the spec precisely all of these cases are actually invalid URI's as you need the // before host, rather than the commonly misunderstood having the // as part of the scheme part. But user experience is really what this ticket is about.

It seems dotless domains are frowned upon (https://en.wikipedia.org/wiki/Top-level_domain#Dotless_domains)

As the dotless domains are frowned upon, I would say having an allow-list of dotless domains to use as URL's, and the rest should be searched. A simple start would would to that list would be:

The latter two may just be done for free with ladybird's processing.

To summarise: For single words with no dots and no scheme/port/username/password/path etc, dd an allow-list which contain's localhost (and any others people suggest), along with a valid IP address (v4 or v6). Everything else should get searched for.

In the future perhaps this should be extended to include anything that resolves (locally via hosts and/or globally via DNS).

Apologies for the long post, but I thought it would be good to have thorough research before starting anything. If this all sounds good to folks I can make a PR sometime soon.

neil-and-void commented 18 hours ago

.abc has a valid SOA record, and .def does not

It would perhaps be nice to check for a TLD against a list like this to determine whether a schemaless domain coerces to a full URL or uses a search engine. Not sure if this should be a list that is dynamically retrieved on something like a monthly basis or should just exist in the project.