Open wrongu opened 1 year ago
need to pull RIGHT BEFORE we push to minimize merge errors (esp with overleaf since it rewrites git history all the time, but also generally true of all git)...
I understand wanting to minimize merge errors, but that's also a race condition!
Imagine:
now the lines being edited in step 5 != the same lines that were triggered in step 2
basically we're rediscovering the old problem that "simultaneous editing is hard" and this is why OT was invented. And one of the core ideas of OT is that every edit needs to be relative to some timestamped/hashed version of the document. If you try to change the hash/timestamp of an edit before/after the edit is done, you're gonna have a bad time.
Right now we're using git merge as a hacky version of OT. git already does timestamping/versioning for us using commits. We just need to lean into that for now, and follow the rule that 1 edit = 1 commit
yep — failure mode here is that if you don't do the race-conditiony thing, your push fails and you throw away all your work anyway. if you do the second pull, you have a last-ditch effort for it to be valid (and will still fail if the doc has been updated during the AI round trip in a way that invalidates the edit).
I'm cool with either answer here provided it actually runs; the reason I have the duplicate pull right now is because it invariably failed without it due to overleaf's weird treatment of histories
On Thu, Jun 8, 2023 at 11:11 AM wrongu @.***> wrote:
I understand wanting to minimize merge errors, but that's also a race condition!
Imagine:
- pull
- trigger an edit
- check for debouncing --> ok! no edits in the last few moments
- simultaneously (LLM starts working) and (user deletes a line earlier in the paper)
- pull again and edit lines
now the lines being edited in step 5 != the same lines that were triggered in step 2
basically we're rediscovering the old problem that "simultaneous editing is hard" and this is why OT was invented. And one of the core ideas of OT is that every edit needs to be relative to some timestamped/hashed version of the document. If you try to change the hash/timestamp of an edit before/after the edit is done, you're gonna have a bad time.
Right now we're using git merge as a hacky version of OT. git already does timestamping/versioning for us using commits. We just need to lean into that for now, and follow the rule that 1 edit = 1 commit
— Reply to this email directly, view it on GitHub https://github.com/KordingLab/llm4papers/issues/19#issuecomment-1582777863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFJKB7T5HNL6BYWGID72B3XKHTSPANCNFSM6AAAAAAY7NGIFU . You are receiving this because you commented.Message ID: @.***>
In that case we really need our own mini-OT system. This would be a function with a signature like
def apply_ot_using_git(commit1, edit1, commit2) -> edit2:
....
And the simplest version of this function would be just to update the line numbers in the edit to account for any changes that happened between commit1 and commit2 OR raise an error if commit1->commit2 involved changes to the same lines
...follow up thought: as long as we're happy to lean heavily on git, and we're happy to just error-out when it fails, the ot
function here can be a git rebase
. We would apply edit1
to commit1
, commit it as commit3
, and then rebase commit3
onto commit2
...follow up Part II: refactoring the trigger lifecycle further, we want to allow 1 Trigger to give rise to any number of Edits. So I think we may want PaperRemote
to support perform_batch_of_edits()
not just perform_edit()
, which is another reason where we want our own OT, but this one might be ugly if done using git 😢
sigh —
EditResponse should probably have some signature like,
class EditResponse:
def as_diff(self, repo):
...
def as_ot(self, original_text):
...
def as_totally_new_string(self, original_text):
...
so that we can apply an edit to any doc type....
or maybe EditResponse is super simple in structure but each paperremote needs to know how to apply it themselves.... have NO idea what that API would look like.
See #21 . I made EditResult
a data class to parallel EditTrigger
. I figure we want the Trigger/Result semantics to be independent of a particular remote, but we can hold the remotes to a high-ish standard where they must provide some basic OT
current trigger/edit lifecycle is
but the extra pull in step 3 can invalidate the triggers in step 2
should be able to just remove the extra pull