brave / goggles-quickstart

Educational material to learn about Goggles and how to create your own.
https://search.brave.com/goggles
605 stars 51 forks source link

Feature Requests: Goggles DSL Implementation #12

Open Ompanime opened 2 years ago

Ompanime commented 2 years ago

I’m really impressed with the new Goggles feature and believe it to be a very powerful tool to refine search results. I’ve currently working on a few right now and have published one for public use. I know Goggles are still new and not all intended features are working/implemented, but I have found two shortcomings that are hindering me from creating a more complex and refined Goggle.

This first is that I can’t specifically target the domain name when discarding websites, but rather the code looks for everything in the URL including the subdirectory. There are websites that I would like to discard that have specific wording that I would like to target, but I don’t want to remove dictionary results that also have the wording in their URL subdirectory.

For example, instead of individually discarding every website with the word “Hollywood” in its domain, I would like to be able to discard every website with that keyword in one line of code without the code also targeting the subdirectory in the URL. This way results from news outlets and dictionaries that have that same keyword in their subdirectory will remain.

Secondly, I am aware that when writing a Goggle it’s possible to have lines of code that conflict with one another. Although not intended, I understand people want their code/Goggle to work a certain way. In such cases I believe it would be nice to have a exception rule.

For example, if the code was:

$downrank=3,site=example.abc
/posts/$boost=3

There could be an exception that would apply to “example.abc” where they have a subdirectory of /posts/ (example.abc/posts/)–allowing for only those specific results to be boosted despite the website itself being downranked.

I hope my feedback helps improve this already powerful technology. And thank you for all the great innovation you’ve brought us, such as Brave Goggles–which I think have become one of my favorite features!

remusao commented 2 years ago

Hi @Ompanime,

Thanks a lot for your encouraging words and great feedback. Regarding the first feedback, could you please give an example? I am not certain to have understood what you are looking for.

Regarding the second feedback, I think the behavior you describe should already be the default behavior implemented by Goggles. Given the rules:

$downrank=3,site=example.abc
/posts/$boost=3

The URL https://example.abc/posts/foobar.html will get a boost of 3 but any other page will get downranked.

Ompanime commented 2 years ago

Hi @Ompanime,

Thanks a lot for your encouraging words and great feedback. Regarding the first feedback, could you please give an example? I am not certain to have understood what you are looking for.

Yes, indeed! For example, I created a Goggle to mitigate celebrity culture influence. I've noticed that a lot of the websites I would like to remove from the results have the keyword "Hollywood" in the domain, and I would like to $discard them all with a single line of code such as: hollywood$inurl,$discard, instead of listing each website individually. The command $inurl targets the entire URL (i.e. the scheme, subdomain, second-level domain, top-level domain, and the subdirectory), as it suggests by it's name.

Parts of a URL

The issue is that I don't want to discard results from educational websites that include the keyword "hollywood" in their subdirectory such as: example.abc/wiki/hollywood

I hope this example helps!

devidw commented 2 years ago

@Ompanime just a short note on the syntax, if I understand, instead of

hollywood$inurl,$discard

It should be only one $ and the action and all options behind this, comma separated:

hollywood$discard,inurl

Referencing the example screenshot from the getting started section Fine-tuning a Goggle.

But I am not actually sure whether the inurl option is already available, since in the quick start it says:

! […] but we will add the ! ability to match other aspects of a page too, in the future: ! ! web3$inurl


And for what you are planning to do, I think it can be accomplished similar to this goggle: first discarding all and then whitelisting.

So I imagine something like the following should do the job?

hollywood$discard
hollywood$site=i-want-to-keep-this.site
! ...
Ompanime commented 2 years ago

But I am not actually sure whether the inurl option is already available, since in the quick start it says:

! […] but we will add the ! ability to match other aspects of a page too, in the future: ! ! web3$inurl

Technically, all instructions automatically target the URL as written in the quick-start guide, meaning no matter what filter attribute you designate (i.e. $intitle, $incontent, etc.) the actions (i.e. $discard, $boost=XX, $downrank=XX) will only apply to the URL. I still specifically include the $inurl command so my code won't be broken once they update the Goggles codebase to include other attribute targeting options.

! Another set of options can be used to indicate what you want your instruction ! to target. By default any instruction will apply to a URL, but we will add the ! ability to match other aspects of a page too, in the future: ! ! web3$inurl ! web3$intitle ! web3$indescription ! web3$incontent


And for what you are planning to do, I think it can be accomplished similar to this goggle: first discarding all and then whitelisting.

So I imagine something like the following should do the job?

hollywood$discard
hollywood$site=i-want-to-keep-this.site
! ...

I am aware that the generic $discard action, applied without a target will discard all other results that don't match the rules as set out in the Goggle. But it is to my understanding that code will remove every site that is not specifically boosted or "whitelisted", in the Goggle--and I don't want to do that! I want people who use my Goggles to be able to browse the internet as they normally would while also benefiting from the removal of certain content they don't want showing up in their search results.

Additionally, I should mention that you're right about my syntax choice--only one "$" character is needed per instruction, with every subsequent action/attribute option following it separated by the "," character. I've already written a lot of my code with multiple "$" characters in each instruction, and I don't want to tediously change every single line. Plus my code still seems to work fine, so unless doing this breaks my code, I likely won't go back and change it.

Anyways, thank you for your feedback, @devidw!

remusao commented 2 years ago

@Ompanime thanks again for the very detailed feedback and explanation. If I understood correctly, something like hollywood$inhostname (syntax is not final, just to confirm I got the idea right)?

Also, @devidw is correct in that the instructions with multiple $ symbols are not valid. So hollywood$inurl,$discard should be written as hollywood$inurl,discard otherwise this rule will only discard any result that contains hollywood$inurl, in its URL (which will likely not happen).

Ompanime commented 2 years ago

@remusao yes, that's the right idea! Originally, the way I've been writing my code to try to more specifically target the domain of a URL has been by incorporating the use of anchors and the $inurl command--similar to this:

|bollywood$discard,inurl |broadway$discard,inurl |celebs$discard,inurl |gossip$discard,inurl |hollywood$discard,inurl

celebs|$discard,inurl gossip|$discard,inurl hollywood|$discard,inurl

I'm not sure if it would have worked the way I would like it to (as I have yet to upload my updated code) but my idea was by using anchors to target specific keywords and their placement within the URL I would hopefully single-out the domain and subsequently delist all domains with those keywords. But having a command such as $inhostname to specifically target the domain of a URL would be better, as I fear my current method may accidently and unintentionally find these keywords in the subdirectory and remove results from educational websites (i.e. .edu, and wikis) and dictionary websites! If the command $inhostname would be implemented into the Goggles codebase, do you know if anchoring would be possible, as I've demonstrated in my examples mentioned above (i.e. |hollywood$discard,inhostname, celebs|$discard,inhostname, etc)?

On a side note, I did actually update the syntax of my code as @devidw pointed out to align the with the screenshot he linked me too. As it turns out, after looking at my code it was doing exactly what just mentioned (discarding results that contain "hollywood$inur,").

Ompanime commented 2 years ago

@remusao I just updated my goggles code a few days ago hoping that using anchors to target specific keywords along with their placement within the URL would do the trick and remove all domains which contained such keywords, instead of having to specify each domain individually using $discard,site= command.

With my No Celebrity Goggles on, I tested them to see if the anchoring method, as I described in my last post, would work. But unfortunately, neither anchor placement (left or right) on the keyword gave the desired results. While hovering over the line of code does acknowledge that the code is active, results containing those specific keywords are not filtered out unless that domain is specifically mentioned using $discard,site= command.

While not definite, I believe the $inurl attribute is causing this issue as I figure it's looking for those keywords in their specified placement (as determined by the right and left anchors) across the entire URL (i.e. www.example.abc/example), instead of the second-level domain which where I want it to look (i.e. www.**example**.abc/example)--as indicated in between the two **!

For example, instead of looking for the left-anchored keyword at the beginning of the second-level domain (i.e. https://www.**example**.abc/example)--also indicated in between the two --it looks for it at the beginning of the URL starting with the scheme (i.e. `https://**www.example.abc/example) or the subdomain (i.e.www.example.abc/example). Similarly, I believe the same thing is happening for right-anchored keywords, but instead it's looking for that keyword at the end of the URL, either in the subdirectory (i.e.www.example.abc/example) or in the top-level domain (i.e.www.example.abc`)--when I actually want it look for that keyword in the second-level domain as mentioned before!

Sorry if that sounds confusing. But here are some screenshots, feel free to try to replicate this for yourself using my No Celebrity Goggles!

No-Celebrity-Goggles-Goggles-Brave-Search

No-Celebrity-Goggles-Goggles-Brave-Search 2

coolcelebs-Brave-Search 4

the-hollywood-Brave-Search 3

remusao commented 2 years ago

If the command $inhostname would be implemented into the Goggles codebase, do you know if anchoring would be possible, as I've demonstrated in my examples mentioned above (i.e. |hollywood$discard,inhostname, celebs|$discard,inhostname, etc)?

Yes, this would work. The idea of the $<context> modifiers such as $inurl is to allow you to match with the same syntax of instructions (e.g. | anchors, etc.) but on a different part of the result (i.e. whole URL, hostname, etc.).

Thanks for the last message, it clarifies what I understood you needed from Goggles. At the moment you are correct that the left and right "anchors" will match the beginning and end of the URL (including the scheme like https://). What you need is something like $inhostname that we have discussed above.

We did not yet get to implementing this but it's definitely on our TODO list of things to add to Goggles in the future. Thanks again for all the detailed feedback.