dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.94k stars 509 forks source link

Materialized View Gives PrivilegedDatabase Error #8530

Closed zubeyirgenc closed 2 days ago

zubeyirgenc commented 2 days ago

I need read all dolt_diff function result that takes 3 million row and the joining and ordering operations are so long. So i tried to use write all result to a temporary table and after read. It takes too long too. So on i find dolt VIEW but can not benefit from this feature. When i try to reach to created VIEW it gives mysql_db.PrivilegedDatabase error. I said maybe it can not support indexing on view but it do not allow me to read any part of this data. It gives always same error.

This is easy regeneration of the error

CREATE VIEW view_name AS SELECT * FROM DOLT_DIFF('base_branch', 'compare_branch', 'table_name')
CALL DOLT_ADD('-A')
CALL DOLT_COMMIT('--skip-empty', '--author', 'Admin <admin@admin.com>', '-m', 'message')
SELECT COUNT(*) FROM view_name

This is my setup:

  1. dolthub/dolt:1.43.11 on docker compose
  2. FastAPI base python application
  3. mysql-connector==2.2.9
  4. mysql-connector-python==8.4.0
  5. sqlalchemy==2.0.35
  6. sqlalchemy_utils==0.41.2
timsehn commented 2 days ago

We can fix the error and make views work with Dolt functiosn, but computing that diff is going to take a while. Diff speed scales with the size of the diff, not the size of the table, so big diffs take a while to compute.

jycor commented 2 days ago

Hey @mgedigian thanks fore reporting this issue! We have a fix out for review regarding views over dolt procedures, and will keep you updated on the progress

I'd expect a 3 million row diff to take a long time... However, if you provide us with more info we might be able to help? Things like the schemas of the table you're diffing, number of commits, number of rows, etc.

zubeyirgenc commented 2 days ago

Thanks for your response. I can't wait to use this development @timsehn

In my setup, each branch has approximately 200k rows of main table elements and at most 1.5k rows of relationship elements with other tables. Also, each branch contains an average of 10-15k commits. There is no problem at this scale, but after merging 15 branches in main branch, calculating diff between t_0 and t_15 is like death. Actually, calculating diff on its own is fine, but ORDER BY or GROUP BY operations drag on @jycor

Yes, diffing 3 million rows will take a long time, but I don't have any trouble waiting as long as it deserves. Since the server is running on the remote machine, I can't get healthy results either through workbench or app. Normally, writing the result to a new table in local tests would reduce the reading time, but I keep having connection problems in the remote scenario. My expectation from MATERIALIZED VIEW was to be able to create indexes and thus prevent locks in JOIN, ORDER BY and GROUP BY operations.

If this does not meet the expectation, I may consider getting rid of JOIN and GROUP BY. However, another question that comes to mind at this point is how stable can I read with LIMIT and OFFSET if I do not use ORDER BY? In other words, since I cannot read 3 million rows at once, I will have to paginate, but how can I be sure that I read different IDs for each page? (It can be assumed that the DIFF result does not change and no new commits come)