Segfault-Inc / Multicorn

Data Access Library
https://multicorn.org/
PostgreSQL License
700 stars 145 forks source link

Update method may be conceptually wrong #143

Open Bengreen opened 8 years ago

Bengreen commented 8 years ago

I am trying to write an update function.

When I read the documentation in the python code it tells me that the update function provides two argument oldvalues, and newvalues

def update(self, oldvalues, newvalues):

But when I inspect the oldvalues variable it does not hold the old values dict, it only holds the value of the rowid. I checked back into he C code and it appears to confirm this on line 762 of multicorn.c

It seems that the newvalues is not correct either as it does not contain the new values, but actually all the values that are non Null after reading the original values and adding the updates requested.

Whilst this works for my basic tests, I think it will cause some problems when run concurrently with other operations as it depends upon read/modify/write type operations which would need to be guaranteed to run consecutively to be correct (which i don't think can be guaranteed).

I see two options for fixing this.

  1. if the oldvalues is renamed to rowid and the newvalues is limited to ONLY the values to be updated then this would be correct semantics for an update function. I could then execute the update operations against the remote server as a single operation.
  2. If the oldvalues provides a dict of old values, not just rowid then i could calculate the diff and then propagate just the diff to the remote data store. BUT this is depends upon doing a read of the record to calculate the diff (outside of the remote system) so could be very large even for a small field change. SO this would not be an ideal option, but I think would be correct with respect to write operations applied.
Bengreen commented 8 years ago

My worry is that since you cannot guarantee to have row level locking on the remote data system then you can initiate a RMW operation (i.e. update does read, modifys then writes all columns, not just those changed) and after the initial READ you identify you MODIFY your record in memory and before the WRITE is completed another transaction occurs which also does a RMW operation in completion. BUT .... if the 2nd RMW is for the same columns that the 1st is changing then this is an application tier problem and bad programming practice as there should have been some form of locking here. HOWEVER.... if the 2nd RMW is for different columns than the 1st is changing then this is perfectly reasonable, but would break since the 1st update would overwrite the changes from the 2nd write.

e.g. initial state : record = {id: 1, val1: 1, val2: 2}

Update 1: UPDATE val1=3 where id =1 Update 2: UPDATE val2=4 where id=1

You would expect the result to be record {id:1, val1:3, val2=4}

BUT it if both reads occur before either writes complete then you can get either: record {id:1, val1:1, val2=4} record {id:1, val1:3, val2=2}

FROM my crude searching it appears that the value of p_value in the function multicorn.c/multicornExecForeignUpdate contains all of the fields of the record (i.e. a merge of the read and updated fields).

Bengreen commented 8 years ago

I have been working through this challenge and have come to the conclusion that it needs to be solved using 1 of three methods.

  1. take out and maintain a row lock on the FOREIGN server at the point in time when the record scan is performed during the update. then hold the lock for the duration of the RMW cycle. Release the row lock at the end. This could work well and potentially has support in the FDW API from PG.
  2. When doing the table scan also collect the columns that are used in the WHERE clause. When processing the UPDATE operation re-use the WHERE clause items to re-confirm the correct record is updated (i.e. re-check the where clause along with the primary key on the write operation).... BUT I am having some trouble convincing myself this is reasonable.
  3. perform the UPDATE function as a direct operation. There is provision for this within the FDW methods provided by PG. I don't think they have been translated to Multicorn yet. The PG method where this can ben done is : PlanDirectModify. This seems the most feasible since you are likely using a DB where this is possible if you are considering this type of issue.