calpaterson / csvbase

a simple website for sharing table data - with an API
https://csvbase.com
GNU Affero General Public License v3.0
376 stars 13 forks source link

Filter on user primary key in tables_for_user #124

Closed equirke closed 6 months ago

equirke commented 6 months ago

Hello, Only noticed this after merging the last PR.

I noticed we are filtering the tables by the user's uuid foreign key on the table but since we are already inner joining the user table I thought it would be better to filter on the user's primary key which would be indexed.

SQL Generated ``` SELECT metadata.tables.table_uuid AS metadata_tables_table_uuid, metadata.tables.user_uuid AS metadata_tables_user_uuid, metadata.tables.public AS metadata_tables_public, metadata.tables.created AS metadata_tables_created, metadata.tables.licence_id AS metadata_tables_licence_id, metadata.tables.table_name AS metadata_tables_table_name, metadata.tables.caption AS metadata_tables_caption, metadata.tables.last_changed AS metadata_tables_last_changed, metadata.tables.backend_id AS metadata_tables_backend_id, metadata.users.username AS metadata_users_username, metadata.github_follows.table_uuid AS metadata_github_follows_table_uuid, metadata.github_follows.last_sha AS metadata_github_follows_last_sha, metadata.github_follows.last_modified AS metadata_github_follows_last_modified, metadata.github_follows.https_repo_url AS metadata_github_follows_https_repo_url, metadata.github_follows.branch AS metadata_github_follows_branch, metadata.github_follows.path AS metadata_github_follows_path FROM metadata.tables JOIN metadata.users ON metadata.users.user_uuid = metadata.tables.user_uuid LEFT OUTER JOIN metadata.github_follows ON metadata.tables.table_uuid = metadata.github_follows.table_uuid WHERE metadata.users.user_uuid = %(user_uuid_1)s ORDER BY metadata.tables.created DESC ```

Thanks

(Good excuse to see if actions run on pull requests.)

equirke commented 6 months ago

It seems that the user_uuid on tables is indexed too because of the unique constraint so it doesn't make much of a difference.

equirke commented 6 months ago

it would have to scan the tables table to perform the join anyway brainfart moment, but we know actions will run on pull requests now

calpaterson commented 6 months ago

I've had similar confusions in the past :) Good to see that others have them too!

You are probably going to spot many subpar queries in the code at the moment though! eg in this one we pointlessly query again for the unique columns inside the for loop even though clearly that could have been done in the original query....the classic n+1.

We should sort that one out though in general my personal policy on queries has been that I have not heavily optimised them but I have I've tried to lay things out so that eventually optimisation will be possible.

The only code that has a performance problem in practice at the moment is the csv upload code. Occasionally that does time out. My plans there (in order):

  1. Adopt Python3.11 or above, which has lots of speedups
  2. Split that work over two page loads (has to be done anyway - see the git upload flow)
  3. Upgrade to psycopg3, it has some new copy stuff, which may speed it up (need SQLAlchemy v2 first)
  4. (Only if all else fails) Move it to a background worker. Keen to avoid this if possible as it's quite complicated to deploy

It is nice to know that github actions is working now though :)