chaoss / augur

Python library and web service for Open Source Software Health and Sustainability metrics & data collection. You can find our documentation and new contributor information easily here: https://oss-augur.readthedocs.io/en/main/ and learn more about Augur at our website https://augurlabs.io
https://oss-augur.readthedocs.io/en/main/
MIT License
592 stars 845 forks source link

Ability to stop indexing a repo #2923

Open GregSutcliffe opened 1 month ago

GregSutcliffe commented 1 month ago

Is your feature request related to a problem? If so, please describe the problem: I have a specific but changing list of repos I need to track data on, and a limited number of API keys. Accordingly, I would like to be able to stop indexing repos in some manner.

I've had a look through the docs, and I can't see anything in the CLI or UI that allows me to do this, or to delete the data for a repo that may no longer be relevant to me.

Potential solutions: The obvious solution would be to add a command to the augur db since that already has commands for add-repos and for listing groups. A UI solution is nice but not essential, I think.

Additional context: This is actually relevant even for a day-1 fresh install, as Augur starts up with a pre-seeded set of repos to index. While this makes sense to give it something to operate on, at some point in time the user is likely to want to stop indexing those example repos.

sgoggins commented 3 weeks ago

Hi @GregSutcliffe : This is logic for deleting repositories. Its a bit of a dangerous operation, so I haven't ever automated it before. A better strategy would be to permeate logic to be able to remove a repository from collection circulation, but there are in fact also reasons to delete entirely. One case is when a repository is added twice, which cannot occur anymore, but could occur with a confluence of events in the past (repo moved, but added at the new location before move logic that already exists is executed. This loophole is closed (we think)).


ALTER TABLE "augur_data"."pull_request_message_ref" 
  DROP CONSTRAINT "fk_pull_request_message_ref_message_1",
  ADD CONSTRAINT "fk_pull_request_message_ref_message_1" FOREIGN KEY ("msg_id") REFERENCES "augur_data"."message" ("msg_id") ON DELETE CASCADE ON UPDATE CASCADE DEFERRABLE INITIALLY DEFERRED;

select * from repo where repo_id in 
(
235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

select * from augur_operations.collection_status where repo_id in 
(235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from issue_message_ref where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from pull_request_review_message_ref where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from pull_request_message_ref  where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

COMMIT; 

delete from repo_info where repo_id in (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from augur_operations.collection_status where repo_id in (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from augur_operations.user_repos where repo_id in 
(235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from issue_assignees where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from releases where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from pull_request_reviews where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from pull_request_files where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from pull_request_commits where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from pull_requests where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from repo_badging where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from issues  where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from repo_deps_libyear  where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from repo_deps_scorecard  where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from repo_dependencies where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from augur_operations.collection_status where repo_id in (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

commit; 
delete from commits where repo_id in  (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from repo_labor where repo_id in (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

select from pull_request_message_ref  where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from pull_request_message_ref cascade where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

COMMIT; 

ALTER TABLE "augur_data"."pull_request_review_message_ref" 
  DROP CONSTRAINT "fk_pull_request_review_message_ref_message_1",
  ADD CONSTRAINT "fk_pull_request_review_message_ref_message_1" FOREIGN KEY ("msg_id") REFERENCES "augur_data"."message" ("msg_id") ON DELETE CASCADE ON UPDATE CASCADE DEFERRABLE INITIALLY DEFERRED;

delete from message cascade where repo_id in (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697) ; 

commit;                                                            

delete from pull_request_review_message_ref cascade where repo_id in 
 (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

delete from repo cascade where repo_id in (235219,196224,195980,196007,195987,196185,196178,196186,196099,196110,196111,196165,196101,196100,196102,196103,196104,196105,196107,196108,196109,196145,196147,194697); 

COMMIT; 
GregSutcliffe commented 3 weeks ago

@sgoggins 100% agree with the danger, and I agree with the idea of stopping the indexer as an intermediate step. Do you have notes for that as well? Or is it just removing it from the repo table?

cdolfi commented 3 weeks ago

I am a fan of the idea of stopping the indexer (with some way to note that it is stopped). There might be a scenario where you need to pause repo(s) if you dont have enough keys to support the collection of them all

GregSutcliffe commented 3 weeks ago

Good point @cdolfi. That's probably more important than deletion anyway - disk is cheap for storing old data, but keys are limited.

@sgoggins would you agree? Happy to work on this

sgoggins commented 3 weeks ago

@GregSutcliffe : There are two dimensions for leaving the old repository there and simply not keeping it visible.

First, what we are mostly discussing, which is stopping collection for a repo if we no longer have interest. Second, do we want to presume that this "deleted repo" should also not be displayed in APIs, or other front end indicators? I presume yes on both, but am checking with our shared understanding.

For the display there are two considerations then as well:

Most of the cases on a shared instance would be, in all likelihood, "Delete from my list". This is certainly the easiest use case to implement because it actually does not change collection behavior at all.

Of course, we also know from our discussions that we do want a user with elevated privileges of some kind to remove the repository from collection entirely, in order to preserve API key usage. Is this then a third case where we want to keep the repository in a "visible" state, but just not continue to collect on it?

Any of these conditions can, I think, be handled with "state bits" on the repository record itself augur_data.repo and, in these case of user scoped repos only, augur_operations.user_repos.

GregSutcliffe commented 6 days ago

@sgoggins so I think...

So, let's focus on the latter - how would we do this? I assume removing it from the repo table would mean we can't show the old data, so do we need a new "collect? BOOL" state bit on the repo table?