downforacross / downforacross.com

Web frontend for downforacross.com -- continuation of stevenhao/crosswordsio
https://downforacrosscom.downforacross1.now.sh
MIT License
220 stars 92 forks source link

Prevent duplicate puzzles #276

Open dlakata opened 1 year ago

dlakata commented 1 year ago

For #139 and #15

You'll need to create the index for this to take effect. I chose 30000 as the minimum pid to avoid enforcing unique constraints on existing data, and it appears to be an ID that'll be reached in the next week or two.

duplicate_screenshot

I created another table locally called puzzles_test and verified that the index seems to be used:

# select count(*) from puzzles_test;
 count  
--------
 424649
(1 row)

# \d puzzles_test
                         Table "public.puzzles_test"
    Column    |            Type             | Collation | Nullable | Default 
--------------+-----------------------------+-----------+----------+---------
 uid          | text                        |           |          | 
 pid          | text                        |           | not null | 
 pid_numeric  | numeric                     |           |          | 
 is_public    | boolean                     |           |          | 
 uploaded_at  | timestamp without time zone |           |          | 
 times_solved | numeric                     |           |          | 0
 content      | jsonb                       |           |          | 
Indexes:
    "puzzles_test_pkey" PRIMARY KEY, btree (pid)
    "puzzle_test_pid_numeric_desc" btree (pid_numeric DESC NULLS LAST)
    "unique_puzzle_test_content" UNIQUE, btree (md5(content ->> 'grid'::text), md5((content -> 'clues'::text) ->> 'down'::text), md5((content -> 'clues'::text) ->> 'across'::text)) WHERE pid > '30000'::text AND is_public
Check constraints:
    "puzzles_test_times_solved_check" CHECK (times_solved >= 0::numeric)

# explain analyze SELECT
pid
FROM puzzles_test
WHERE pid > '5'
AND is_public
AND md5(content->>'grid'::text) = md5(('[["A","B"],["C","D"]]'::jsonb)::text)
AND md5(content->'clues'->>'down'::text) = md5(('{"down": ["z","y","x","w","v"]}'::jsonb)->>'down')
AND md5(content->'clues'->>'across'::text) = md5(('{"across": ["a","b","c","d","3"]}'::jsonb)->>'across');
-------------------
Index Scan using unique_puzzle_test_content on puzzles_test  (cost=0.42..8.44 rows=1 width=6) (actual time=0.021..0.022 rows=0 loops=1)
   Index Cond: ((md5((content ->> 'grid'::text)) = 'c56da16f5cfea23f45f4dad64b283dc4'::text) AND (md5(((content -> 'clues'::text) ->> 'down'::text)) = '55f362a61d73771fc88b5987a990c553'::text) AND (md5(((content -> 'clues'::text) ->> 'across'::text)) = '1b4d1966f41032490216c0f73ec9c4bf'::text))
   Filter: (pid > '5'::text)
 Planning Time: 0.280 ms
 Execution Time: 0.049 ms
(5 rows)
vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
downforacross.com ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 2, 2023 11:04pm
stevenhao commented 1 year ago

Hey thanks for helping out with this! A few comments:

  1. Could we make the constraint only apply to public puzzles? Unlisted puzzles should allow duplicates
  2. Also, do we need to materialize the md5 hashes in db (and add an index on it) as an optimization? I'm not too familiar w/ using md5 but this looks like an unindexed query that could get pretty slow...

Otherwise, the code change looks great!

dlakata commented 1 year ago

Could we make the constraint only apply to public puzzles? Unlisted puzzles should allow duplicates

I think the WHERE pid > '30000' AND is_public clause should handle this: https://www.postgresql.org/docs/current/indexes-partial.html

do we need to materialize the md5 hashes in db (and add an index on it) as an optimization? I'm not too familiar w/ using md5 but this looks like an unindexed query that could get pretty slow...

I think CREATE UNIQUE INDEX unique_puzzle_content should add the index, which the Index Scan using unique_puzzle_test_content test in the description used. Based on https://www.postgresql.org/docs/7.3/indexes-functional.html, I think we don't need to materialize, and postgres will store the index results. But I'm new to functional indexes, so you should verify my understanding!