alphadevx / alpha

Full-stack MVC framework for PHP.
http://www.alphaframework.org/
BSD 3-Clause "New" or "Revised" License
5 stars 0 forks source link

Add a web search indexer #387

Open alphadevx opened 6 months ago

alphadevx commented 6 months ago

The aim is to add the following components:

  1. A web crawler : suggest using https://packagist.org/packages/spatie/crawler or https://packagist.org/packages/crwlr/crawler
  2. A persistent layer to store URLs already indexed.
  3. A job to write the results to Solr via https://packagist.org/packages/solarium/solarium
alphadevx commented 6 months ago

In a Solr index built using Nutch, each entry in the index contains an entry with these fields:

      {
        "tstamp":["2023-04-29T18:43:39.767Z"],
        "digest":["4a297ca583c890a577be6b8f50b3e6f1"],
        "host":["alphaframework.org"],
        "boost":[1.9125815E-5],
        "id":"https://www.alphaframework.org/a/Documentation",
        "title":["Alpha Framework (4.0.0)"],
        "lang":["en"],
        "url":["https://www.alphaframework.org/a/Documentation"],
        "content":["...."],
        "_version_":1764539126025027584},
      {

Based upon that schema, I will add in a new active record IndexedPage with the following attributes:

I will not store the content itself (including the title), as that will be stored inside Solr.