forgeflux-org / starchart

Software forge spider
GNU Affero General Public License v3.0
13 stars 2 forks source link

Auto-deleting repositories #6

Closed realaravinth closed 2 years ago

realaravinth commented 2 years ago

Repositories get moved, made private or simple deleted. The spidered data should be adapted to keep up with the changes.

We need to log force's last_crawl in a separate table and associate this value with starchart_repositories. We could then use a maximum threshold of accepted staleness and delete repositories that exceeds that

For instance, if repoA is crawled at c=1 and that forge instances gets crawled three more times and repoA hasn't seen any updates(data about it wasn't returned by the forge), then it can be deleted if max_acceptance = 3

MoralCode commented 2 years ago

i wonder if instead you could simply store the timestamp of the last crawl, then have the ranking algorithm just slowly start de-prioritizing repos that haven't seen a crawl result in a while, essentially making them drop off the first page of results. it might cost more storage but it could also be useful for trend analysis, like if all repos from a particular host stop getting crawl results at the same time, that may mean a server went down or something (in which case you may not want to penalize them)

I also imagine different repositories need different crawl frequencies, sort of like what google does with their googlebot where high traffic, frequently-updated sites like news sites get crawled way more often than some static portfolio page that doesnt change as often

realaravinth commented 2 years ago

i wonder if instead you could simply store the timestamp of the last crawl, then have the ranking algorithm just slowly start de-prioritizing repos that haven't seen a crawl result in a while, essentially making them drop off the first page of results

Good idea! I'm already storing a last craw ltimestamp, but I haven't thought about using a ranking algorithm.

I was going to write simple filters that show newly added forges, recently crawled repositories and newly added repositories because these are simple metrics that are fully in control of Starchart and can't be manipulated by external parties(like repository/forge owners).

But are there use cases where retaining moved/deleted repository metadata is useful?

I also imagine different repositories need different crawl frequencies

I'm implementing a mechanism to allow Forge admins to set the crawl frequencies, so boosting frequency might not be possible. Starchart will federate soon(the primitives already exist, but the pipeline is yet to be created), if there are multiple Starchart instances federating with each other, the crawling activity will closely determine DDoS attacks. So it is important that the Forge admin has controls over which Starchart instances can crawl their forge and the rate of crawling.

Besides, the data collected by Starchart will not change often, so we could make do with periodic, same-interval-for-every-repository-on-the-forge crawls.

MoralCode commented 2 years ago

when you say "starchart will federate soon", do you mean that starchart itself will be federated? or that it will be able to start crawling federated forges (i.e. forges that implement forgefed)?

MoralCode commented 2 years ago

But are there use cases where retaining moved/deleted repository metadata is useful?

I cant think of any compelling ones right now (maybe repository age is a factor in the ranking system?), but it might be useful to have it be configurable so that if an instance/fork of starchart wants to use it for something cool, then the ability is there

realaravinth commented 2 years ago

when you say "starchart will federate soon", do you mean that starchart itself will be federated? or that it will be able to start crawling federated forges (i.e. forges that implement forgefed)?

Starchart itself will federate. Bootstrapping a crawler from scratch is a lot of work, federating will allow Starchart instances to benefit from each other's work. I honestly can't picture why someone would need a personal Starchart instance but Starchart will require mechanisms to authenticate data source and to bootstrap off of another instance. Implementing them both will get us federation for free as federation w.r.t Starchart is just bootstrapping continuously.

I cant think of any compelling ones right now (maybe repository age is a factor in the ranking system?), but it might be useful to have it be configurable so that if an instance/fork of starchart wants to use it for something cool, then the ability is there

I'm sorry, I don't follow: are you saying having an optional ranking system is useful or optionally retaining moved and deleted repositories is useful?

MoralCode commented 2 years ago

I'm sorry, I don't follow: are you saying having an optional ranking system is useful or optionally retaining moved and deleted repositories is useful?

this was mainly referring to moved repositories. If "repository age" is something that starchart wants to make available for search engines to use for ranking, it might make sense to maintain this number even if a repo is moved. That's probably not a great usecase but its something i came up with relatively quickly

realaravinth commented 2 years ago

Apologies for the delayed response, I was taking some time off :)

"repository age" is something that starchart wants to make available for search engines to use for ranking, it might make sense to maintain this number even if a repo is moved.

Interesting use case!

I found another use case: there are ongoing discussions on the ForgeFed repository to add a “mirrors” property to repositories. If added, then Starchart could generate a graph of mirrors, which can help folks find a repository's alternate sources should the main repository disappear one day.

So deleting repositories is probably not a good idea, closing issue :)