BrightDV / BoxBox

Unofficial Android and web app for Formula 1 and Formula E fans!
https://codeberg.org/BrightDV/BoxBox
GNU General Public License v3.0
101 stars 11 forks source link

[FEATURE] Search articles #41

Closed BrightDV closed 1 year ago

BrightDV commented 1 year ago

Is your feature request related to a problem? Please describe. /

Describe the solution you'd like A search functionnality, at least for articles.

Describe alternatives you've considered /

Additional context

26

BrightDV commented 1 year ago

For now, the app search for articles using SearXNG instances. The instances selected allow showing the results in JSON format in order to avoid scraping. However, the requests are often blocked because of rate limits. For the search, it filters the results using search parameters: it searches for "formula1.com/en/latest/article" $query so the url of the result must contain the string between the double quotes. Thus, it only returns articles.

One workaround is to use the RSS feed of Formula 1 and then search in it. For the moment, I didn't find any way to get more than 22 articles. Furthermore, I don't think that getting 1000 articles and then searching among them is a good solution, as it will use a lot of bandwidth and be very slow.

sinfullad commented 1 year ago

For now, the app search for articles using SearXNG instances. The instances selected allow showing the results in JSON format in order to avoid scraping.

Sorry for the dumb question, but what do you mean by avoid scraping in this context?

Also in the worst case of scenario of all of the selected instances going down, are there any search engines you plan to use as the fallback option or will you use other instances? Currently I found Metager (metasearch similar to SearX), Mojeek (UK, uses its own crawler), Swisscows (data center in Switzerland, uses Bing Search and Bing Ads, though it uses its own indexes for Germany) to be viable options as well

BrightDV commented 1 year ago

Sorry for the dumb question, but what do you mean by avoid scraping in this context?

I don't like fetching a page and then extracting the content, but I will try to see if the rate limits still apply. If it doesn't, I will add the scraping if no results are found using the first method.

Also in the worst case of scenario of all of the selected instances going down, are there any search engines you plan to use as the fallback option or will you use other instances? Currently I found Metager (metasearch similar to SearX), Mojeek (UK, uses its own crawler), Swisscows (data center in Switzerland, uses Bing Search and Bing Ads, though it uses its own indexes for Germany) to be viable options as well

Thanks for these suggestions! However, I choose SearXNG because the backend is open-source, even if these propositions are made to be private. With the scraping, there are up to 106 instances available, so I am going to try this way.

BrightDV commented 1 year ago

The good news is that requesting the page in HTML format is not rate limited, so it will work better. I have implemented a basic scraping when the five previous requests did not work, but I will improve it later.

BrightDV commented 1 year ago

Added in latest release (v0.4.0).