Riverscapes / riverscapes-tools

Open-source Python 3.0 tools for the Riverscapes organization
https://tools.riverscapes.net/
GNU General Public License v3.0
11 stars 11 forks source link

Delete Duplicate Exchange Projects #790

Open MattReimer opened 1 year ago

MattReimer commented 1 year ago

Let me preface this ticket by saying "BE VERY VERY CAREFUL!" Measure twice, cut once!

Sometimes through weird queueing mistakes we end up with projects that are duplicates of one another

A duplicate is defined as:

Example

https://warehouse.riverscapes.net/s?type=Project&geo=-91.9624074265721%2C30.40674608731412%2C8.766197736186141

You can use the new Cybercastor Sqlite3 dump script to pull all the project metadata from the warehouse into a bitesize package

Here's what ChatGPT gave me for finding duplicates while leaving the most current project out of the list

@philipbaileynar maybe you could check this carefully before we use it:

SELECT
    p.id,
    strftime('%Y-%m-%d %H:%M:%S', datetime(p.created_on / 1000, 'unixepoch')) AS createdDate,
    p.owner_by_name,
    p.project_type_id,
    m.value AS HUC,
    m2.value AS "Model Version",
    CASE
        WHEN (p.created_on, p.pid) IN (
            SELECT MAX(created_on), pid
            FROM riverscapes_projects
            GROUP BY owner_by_name, project_type_id, (
                SELECT value FROM riverscapes_project_meta
                WHERE project_id = riverscapes_projects.pid AND key = 'HUC'
            ),
            (
                SELECT value FROM riverscapes_project_meta
                WHERE project_id = riverscapes_projects.pid AND key = 'Model Version'
            )
        ) THEN 'KEEP'
        ELSE 'DELETE'
    END AS status
FROM riverscapes_projects p
JOIN riverscapes_project_meta m ON p.pid = m.project_id AND m.key = 'HUC'
JOIN riverscapes_project_meta m2 ON p.pid = m2.project_id AND m2.key = 'Model Version'
ORDER BY HUC, project_type_id, created_on DESC;

Concerns

Deleting projects graphql

mutation deleteProject_mutation(
  $projectId: ID!
  $options: EntityDeletionOptions
) {
  deleteProject(projectId: $projectId, options: $options) {
    error
    ids
    message
    success
  }
}

INPUTS

{
    "projectId": "a1f28458-7934-40c6-a1c9-bb781071e2bf",
    "options": {
        "totalDelete": true
    }
}

Sample SQL output

id createdDate owner_by_name project_type_id HUC Model Version status
1ca33d00-9d0d-4a51-9802-6cfba14a1c4d 2023-04-20 02:49:56 USU ETAL anthro 0401020110 0.0.2 KEEP
a21617fe-94ce-4672-9f13-68c33b4b789f 2023-04-17 21:32:36 USU ETAL channelarea 0401020110 1.3.1 KEEP
156d4d8a-5df1-4358-906a-3c5f04542a6d 2023-04-21 03:19:43 USU ETAL rcat 0401020110 3.0.1 KEEP
1959a3ae-6293-4cdb-8bde-4e175e1a2b74 2023-04-21 23:54:06 USU ETAL rs_metric_engine 0401020110 0.4.0 KEEP
8d3e2876-abe8-45b5-8277-84cbb68487c5 2023-04-17 21:32:02 USU ETAL rscontext 0401020110 1.4.1 KEEP
484e85bc-43d0-4e65-89f7-fdf913b6c076 2023-04-14 03:01:41 USU ETAL rscontext 0401020110 1.4.1 DELETE
b0d4d641-dcbe-457d-9d7b-a2607781e476 2023-04-14 02:32:54 USU ETAL rscontext 0401020110 1.4.1 DELETE
35fdae9d-084e-40aa-94b2-56f0627049cd 2023-04-17 21:39:47 USU ETAL taudem 0401020110 1.2.2 KEEP
e24c0eef-cb0d-432b-9e83-e3500089ab29 2023-05-10 01:02:17 USU ETAL vbet 0401020110 3.0.7 KEEP
24a370fb-4534-4f53-bde6-fbd5683ed11f 2023-05-03 04:31:13 USU ETAL vbet 0401020110 3.0.5 KEEP
176572da-7677-4f05-aece-bc1042d88fc7 2023-04-19 21:42:24 USU ETAL vbet 0401020110 3.0.2 KEEP
69fb05b3-69fe-4ac8-bd19-37b574b7941b 2023-04-19 02:29:32 USU ETAL vbet 0401020110 3.0.1 KEEP
ff06b05f-29b4-4031-8d20-6e4cdba0e0e4 2023-05-13 04:39:42 Riverscapes channelarea 0501000101 1.3.2 KEEP
22605032-32b8-438e-982d-5bc8e17bdd4b 2023-05-13 04:39:15 Riverscapes rscontext 0501000101 1.4.2 KEEP
ac452cc4-66d8-4598-a31a-9f3f66f7fb87 2023-05-13 04:40:42 Riverscapes taudem 0501000101 1.2.3 KEEP
bc3557f8-330a-4759-b9cb-670432274a7a 2023-05-13 13:32:45 Riverscapes channelarea 0501000102 1.3.2 KEEP
joewheaton commented 1 year ago

@MattReimer , many of our duplicates I think we need to delete don't meet all your criteria. Specifically they are :

That is because they represent us figuring shit out with our iteration on things like VBET, RCAT, etc. Everything else you are saying is relevant, but we want to initially populate the warehouse with one (not multiple) versions of our production grade tools. Once those BLM, NRCS deliverables are complete, then your definition of a duplicate is much more relevant.

MattReimer commented 1 year ago

Understood. These criteria can be loosened and tightened but the base query remains relevant.

The ones that have the same version are an easy delete though so that's why I started there.