As a renku user, I'd like not to see projects which are very old, created to try out or testing purposes in the search results.
The problem
The amount of data in our deployments (renkulab.io specifically) simply grows quite fast. There's no doubt it's a great thing, however, there are consequences of that. One of them is the general responsiveness of the API while the other is the quality of the data and noise. There is a good number of projects which are old and not even working (for instance created with some very old versions of renku CLI), test projects which were created just to feel what renku is, and test projects which were created by us to verify certain features. All of them are discoverable by the search API which makes it both slow and polluted with not the best quality findings.
The solution
It seems doable to denoise our data by simply removing abandoned projects data from KG (while keeping it in the GitLab so users can bring them back if needed).
Acceptance criteria:
[ ] ~create a new table in the even-log DB to store information about removed projects (PR)~
[ ] ~create a new DB migration that creates a removed_project table defined as follows:~
| removed_project |
|-------------------------------------|
| project_id INT4 NOT NULL |
| project_path VARCHAR NOT NULL |
| removal_date TIMESTAMPTZ NOT NULL |
| scope VARCHAR NOT NULL |
~where the scope can be KG only for now~
[ ] ~create a DB migration that adds a new activated TIMESTAMPZ NOT NULL column to the project table (PR)~
[ ] ~initially, create the column as nullable~
[ ] ~populate it with the min event_date for each project~
[ ] ~make the column NOT NULL~
[ ] EL to handle a new DELETE_PROJECT event
[ ] EL to use the same or similar logic as the ProjectEventsToNew in case of deletion (including sending the PROJECT_VIEWING_DELETION event)
[ ] it may be sensible to move the logic from the ProjectEventsToNew to this new event and send a DELETE_PROJECT event from the ProjectEventsToNew
[ ] EL to issue a CLEAN_UP event to TG
[ ] find out what to do with all the Project Viewed events data that was created for all the projects during the V10 migration date (PROJECT_ACTIVATED events)
[ ] either treat the dates as not set if they fall into the v10 migration period (started at 27/3/2023 12:00)
[ ] create another migration that explicitly changes the dates in the ProjectViewedTime graph by setting the latest event date (the event_date from the event table) for all the projects for which date in the ProjectViewedTime falls into the v10 migration date (to be found from the migrations ds in the TS)
[ ] TG to subscribe and handle CLEAN_UP_ABANDONED events
[ ] TG to take one event at a time
[ ] TG to check for the project specified in the event if:
[ ] ProjectViewedTime registered for it is from more than a year
[ ] the project has no datasets or its datasets are not imported to other projects
[ ] for datasets with topmostSameAs equal to their resource ids check if in the Datasets graph if these topmostSameAs are linked only to the project
[ ] no checks need to be done for datasets with external topmostSameAs
[ ] the project has no forks
[ ] if all the checks are positive the process to send a DELETE_PROJECT event to EL
[ ] EL to issue CLEAN_UP_ABANDONED events to subscribed parties
[ ] an event of the type to be sent for each project found in the project table on a weekly basis
[ ] it should be a fire-and-forget type of communication so no feedback is expected from the subscriber except the accepted|busy response during event delivery and the subscription renewal signalling a subscriber being ready for new work
[ ] EL not to try re-sending the event for a Project once it's delivered to a subscriber and the period of 1 week since the last event hasn't elapsed yet (looks like the subscription_category_sync_time table might be the place to where the info is stored, similarly as what's done for the MEMBER_SYNC flow)
[ ] the event distributor should ensure an event is not lost when subscriber responds with busy
As a renku user, I'd like not to see projects which are very old, created to try out or testing purposes in the search results.
The problem The amount of data in our deployments (renkulab.io specifically) simply grows quite fast. There's no doubt it's a great thing, however, there are consequences of that. One of them is the general responsiveness of the API while the other is the quality of the data and noise. There is a good number of projects which are old and not even working (for instance created with some very old versions of renku CLI), test projects which were created just to feel what renku is, and test projects which were created by us to verify certain features. All of them are discoverable by the search API which makes it both slow and polluted with not the best quality findings.
The solution It seems doable to denoise our data by simply removing abandoned projects data from KG (while keeping it in the GitLab so users can bring them back if needed).
Acceptance criteria:
even-log
DB to store information about removed projects (PR)~removed_project
table defined as follows:~~where the
scope
can beKG
only for now~activated TIMESTAMPZ NOT NULL
column to theproject
table (PR)~event_date
for each project~NOT NULL
~DELETE_PROJECT
eventProjectEventsToNew
in case of deletion (including sending thePROJECT_VIEWING_DELETION
event)ProjectEventsToNew
to this new event and send aDELETE_PROJECT
event from theProjectEventsToNew
CLEAN_UP
event to TGPROJECT_ACTIVATED
events)ProjectViewedTime
graph by setting the latest event date (theevent_date
from theevent
table) for all the projects for which date in theProjectViewedTime
falls into the v10 migration date (to be found from themigrations
ds in the TS)CLEAN_UP_ABANDONED
eventsDatasets
graph if these topmostSameAs are linked only to the projectDELETE_PROJECT
event to ELCLEAN_UP_ABANDONED
events to subscribed partiesproject
table on a weekly basisaccepted|busy
response during event delivery and the subscription renewal signalling a subscriber being ready for new worksubscription_category_sync_time
table might be the place to where the info is stored, similarly as what's done for theMEMBER_SYNC
flow)busy