YGGverse / YGGo

YGGo! Distributed Web Search Engine
MIT License
14 stars 3 forks source link

I think the search results should be divided into categories somehow... #1

Open ygguser opened 1 year ago

ygguser commented 1 year ago

I think it would make sense to add something like this panel:

01

If I want to find sites, then such search results don't make much sense to me:

02

d47081 commented 1 year ago
  1. Yes, images scanning basically implemented but implementation requires performance optimization, so I change my attention on the text index only and wish to test it on huge data volumes.
  2. In this case, me just learning FTS5 because thoughts to make Sphinx server implementation, but it not support sqlite yet. I need to improve relevancy plus do some optimization by URLs filter, as it skips only anchors and getting everything else including media and downloading links...

Anyway thanks, I'm still testing Yacy node on another server, but it seems this solution takes less of disk and ram resources, even suppose I have just 10Gb for indexes on VPS and that's not enough for at least 10M pages, because crawler collecting the raw (html-less) page text, not only meta data (in this case added the optional meta only mode).

The lot of work on content semantic also. Yacy has the same issues, not possible to find something really relevant, because of crawling navigation containers and other spam words.

ygguser commented 1 year ago

It seems to me that sqlite is not very suitable for storing the database of this search engine. We have several mirrors of popular websites on the Internet (rutracker.org , for example) is a very large volume of pages for indexing...

d47081 commented 1 year ago

Agreed, I don't know why did used SQLite instead of MySQL for example, maybe the goal was simple deployment without server and database setup... Anyway, SQLlite has FTS5 solution that allow to make full text relevant search, without external dependencies. Maybe PostgreSQL - but I not familiar there.

By the way, all data model implemented in the single driver file, so that is no problem to implement alternative one and add new settings row to the config file.

About web mirrors, like rutracker I think we need to have some blacklist in the configuration, because what as sense to index resource that has seed/peers, related with the clearnet?

P.s. on 2.5 M links we have 2 Gb of disk space usage, where most of them not indexed yes. So maybe soon I stop crawler, because we need to have some server with enough of disk space. In this case, maybe SQLite better as we may host this file on separated static host, don't know it is possible, suppose that yes.

d47081 commented 1 year ago

About Yacy experiments as the alternative, I have started few topics here https://community.searchlab.eu/t/how-to-apply-regular-expression-to-scan-whitelist-domains-only/1405 https://community.searchlab.eu/t/how-to-configure-external-links-autocrawl-autofollow/1407

But I see, that solution not for small VPS, because running Yggdrasil web directory scanning and it takes 4 cores + 4Gb RAM and index takes 1.45 GB on 10k documents.

Maybe I will continue improve this engine, because much more lightweight.

ygguser commented 1 year ago
  1. Having a search engine in Yggdrasil would be useful and convenient. Such a system is needed.

  2. Of course, the simpler, the lighter this system is (not to the detriment of usability) the better. The hardware issue is a critical issue in this case. A search engine will require computing power and disk space much larger than some simple website. Hosting such a system will require material costs. It's worth considering. And perhaps it would be a good idea to initially design such a system for a decentralized network with the possibility of decentralized (and distributed) database storage. It is clear that this significantly complicates the design of the system... It's just something that's at least worth thinking about... In any case, the issue of choosing a DBMS is a serious issue when designing such a system...

  3. In my opinion, blacklists are undesirable for resources that are accessible via Yggdrasil (it doesn't matter if they are proxied or not). Because it is convenient to type in a search engine, for example, "Interstellar" and immediately get a link to a page with a description of the movie, directly in Yggdrasil. Well, or you can come up with some kind of system that will redirect search queries to the original server on the Internet and return the result, replacing the host in it. In addition, we have not only "proxied mirrors", but also full-fledged copies of databases and files of other popular resources (for example, flibusta). And it would be wrong to ignore these resources in the search index.

I am a little familiar with YaCy and I know that it is quite heavy and consumes a lot of resources...

Taking into account the above, I think it's worth abandoning sqlite in favor of, at least, MySQL, whose database can also be stored on a separate host. MySQL also supports full-text search.

And of course, this project is worth developing, especially if you are interested in it and you think it would be useful :) However, it is also worth considering that you are unlikely to receive material benefits from this in the foreseeable future. Rather, only losses. Although, if one day this system surpasses other popular analogues and there are sponsors / investors ... :)

P.S.: All of the above is the personal opinion of the user, not an expert in the development of search engines.

d47081 commented 1 year ago

design such a system for a decentralized network with the possibility of decentralized (and distributed) database storage

Imho, distributed solutions usually requires more resources, but idea to share the disk storage with API is cool (like Mastodon example, it maybe called Federative model), I have same thoughts in the Roadmap draft presented in README.md. At least we need to understand how many people able to participate this thing before spend a time for implementation (in ygg community context) but I vote yes too.

MySQL also supports full-text search.

and it also supports Replication. yes, me clear to rewrite, SQLite better for desktop/mobile apps, maybe it was my fail as conception + current implementation takes just about 24 hours :) even Idea disturb me for a long time before Yacy node go down.

However, it is also worth considering that you are unlikely to receive material benefits from this in the foreseeable future.

I do this project just for fun in a free time. Of course some donations could motivate me but it is not my goal.

For right now I think about VPS server issue, not sure that my 10Gb one enough for these ambitions. Maybe run it from home (as the node accessible over NAT), maybe run it on Desktop with low uptime but sometimes available (distributed model again)

P.S.: All of the above is the personal opinion of the user, not an expert in the development of search engines.

Thank you for the interest to this project is <3 for me I'm not search engines expert also, just something moves me in exploring alt web.

ygguser commented 1 year ago

Well, since the project is just for fun, there is no need to hurry. And it's definitely not worth burning out. Just for fun means you have to have fun :) In the beginning, you can really give up indexing proxied sites and buying a dedicated server :) Start small... BTW, you definitely need to provide protection from web crawler traps. I'm sure there will be people who will use them in Yggdrasil just for fun. I have already encountered them in Yggdrasil... It is quite possible to start using your desktop PC or an inexpensive VPS. So over time, gradually (perhaps by deleting sites from the blacklist), you will be able to calculate how much disk space you need... In the meantime, you can focus on all sorts of features and usability. Then you can try to advertise this service more actively in Yggdrasil, in Matrix communities, IRC etc. And, perhaps, there will be people who will be ready to participate in the project, provide their computing power or something similar...

I think It is better not to crawle these sites at the start: http://[301:f69c:2017:b6b8::1]/ http://[301:3559:2828:2843::1]/ http://[321:c99a:91a1:cd2c::7]/ http://[301:3559:2828:2843::4]/ http://[321:c99a:91a1:cd2c::18]/ http://[200:1684:3286:e6d8:29e6:4d0c:2428:c2eb]/ http://[300:dada:feda:f443::3]/ http://[301:f886:4ddc:dfdd::1]/ http://[301:f69c:2017:b6b8::5]/ http://[321:c99a:91a1:cd2c::16]/ http://[200:ec3c:c9d8:b529:14f5:6f82:192e:3d76]/ http://[301:f69c:2017:b6b8::9]/ http://[30a:c3d2:8cf8:f8e5:e71e:8c63:1a03:2cbc]/ http://[301:f69c:2017:b6b8::8]/ http://[301:f69c:2017:b6b8::6]/

d47081 commented 1 year ago

I think It is better not to crawle these sites at the start

and most of them are shorterned ;)

ygguser commented 1 year ago

Do you mean addresses from subnets 300::/64? It's just that the proxied site сurrently does not consume a lot of resources and it can be run on an additional IP on the same hardware where some other service or site is located...

d47081 commented 1 year ago

MySQL also supports full-text search.

wait a minute please

d47081 commented 1 year ago

when I get rest in the mental hospital, there learn an ipv6 protocol

but I can't still write a regular expression for the net filtering like 200::/7 or 300::/64 yet

https://forum.yggdrasil.link/index.php?topic=138.msg243#msg243

ygguser commented 1 year ago

You can try to do something like this:

if ((filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6) !== FALSE) && preg_match('/^0{0,1}[2-3][a-f0-9]{0,2}:/', $ip)) {
                $Ygg_addr_OK = true;
        }
ygguser commented 1 year ago

And the forum, unfortunately, is almost dead...

I recommend channels: https://matrix.to/#/#yggdrasil:matrix.org (EN) https://matrix.to/#/#yggdrasilRu:matrix.org (RU) https://app.element.io/#/room/#yggru:matrix.org (RU) https://t.me/Yggdrasil_ru (RU) http://[324:71e:281a:9ed3::41]/web/ (#howtoygg (RU)) https://matrix.to/#/#howto.ygg:matrix.org (RU, ro (useful notifications))

d47081 commented 1 year ago

Sad to hear, Alfis one?

d47081 commented 1 year ago

It return https://forum.yggdrasil.link/index.php?action=profile;u=1

ygguser commented 1 year ago

Alfis is alive and quite popular. This is its developer: https://t.me/Revertron (RU, EN)

d47081 commented 1 year ago

http://[324:71e:281a:9ed3::41]/web/

:D

ygguser commented 1 year ago

IRC web-front-end )

ygguser commented 1 year ago

Btw, I don 't mind cleaning this thread from spam )

d47081 commented 1 year ago

By the way, I can't wait few minutes,

plus, I was stupid that duplicated hostnames in the url row

need hostname table plus uri one to prevent the data expance. in the #2 context. thanks.

https://github.com/YGGverse/YGGo/tree/sqliteway

d47081 commented 1 year ago

Plus some thoughts last night about the page rank columns for the host table etc)

d47081 commented 1 year ago

I think it would make sense to add something like this panel

We have some updates by #3, now I have an idea to make semantic markers to the readme or another separated file, aka yggo.txt where owner can provide the website rubric.

Of course images and videos requires another interface, but for right now I have an idea just to make the websites thematic in the to tab, beside the media search interface.

ygguser commented 1 year ago

We have some updates by #3, now I have an idea to make semantic markers to the readme or another separated file, aka yggo.txt where owner can provide the website rubric.

Meta attributes can be inserted into html pages:

<META name="description" content="My cool site">
<META name="keywords" content="blog, programming, linux">

This can also be used to sort sites by category.

d47081 commented 1 year ago

hm, thanks, but we have limited tray area..

I supposed or ygg.txt/user-agent:ygg*

where we can provide super live results...

ps I won't to have a deal with chat gpt API attempts because before tried to simulate isotopes before understand MIT has a super computer

:)

d47081 commented 1 year ago

I need to add, that we have about 92 hosts in the network crawled for right now (by the new model #3) re-index beginning

playing with CRAWL_HOST_DEFAULT_PAGES_LIMIT by increasing it value to 1k - maybe more hosts available there.

I just keep in mind, that could be awesome to add some extra-semantic rules to our screwed project where ygg people able to build new paradoxical web :p