codekitchen / dinghy

faster, friendlier Docker on OS X. Deprecated.
MIT License
2.12k stars 109 forks source link

On High Sierra, dinghy appears to mark files as modified #269

Closed ms-ati closed 6 years ago

ms-ati commented 6 years ago

Hi all, I can't claim to know exactly what is causing it, but our team has identified a major issue impeding development on Dinghy since we upgraded to Mac OS X "High Sierra" recently.

When Dinghy is running, and a file in the shared folder is edited externally (e.g. in Emacs) and saved to disk. When user attempts to edit the file again, something unexpected happens:

It should be noted that the expected thing also happens: the app inside Dinghy observes the change.

We've verified this behavior occurring in all of: Emacs, Sublime Text, and Rubymine, for example. However, for Rubymine, there is a workaround by changing to nio2 file api.

Hypothesis: Is this due to an interaction between the Dinghy NFS server and the new Apple APFS file system, which has nanosecond modification times?

We've verified that dinghy stop allows the editors to save files unimpeded by this issue.

Any ideas? Anyone else seeing this?

ms-ati commented 6 years ago

Forthcoming: we are going to experiment with running on regular Docker on Vagrant and report back if that suffers the same problem.

ms-ati commented 6 years ago

Is it possible that it has anything to do with this aspect of how APFS behaves, or any other aspect of APFS in particular?

ms-ati commented 6 years ago

Related at all? https://arstechnica.com/gadgets/2017/10/security-and-compatibility-problems-make-for-a-rough-apfs-transition/

ryansch commented 6 years ago

@ms-ati We (myself and coworkers) are all running High Sierra with dinghy and we haven't seen this issue across vim, macvim, neovim, atom, or vs code. Note: we all did a clean install instead of an upgrade.

ms-ati commented 6 years ago

@ryansch We are seeing this universally -- every developer. We all did the upgrades of existing fairly late-model Macbook Pros, rather than clean installs. We all had our drives converted to APFS.

Can you specify: In Disk Utility, is the disk where your code lives an APFS volume?

ryansch commented 6 years ago

@ms-ati Yes all of my macos volumes are inside of an APFS container.

mizhi commented 6 years ago

@ryansch I'm in the same org as @ms-ati , one thing I just thought of, are your APFS volumes encrypted?

ryansch commented 6 years ago

@mizhi They are!

mizhi commented 6 years ago

To add a bit more mystery to this, I setup a vagrant machine with docker and ran our setup in that, and did not experience the issue described above. To make sure I wasn't going crazy, I shut that down and started up everything using dinghy and immediately started getting the file modification issue.

Going to dig some more on the dinghy side to see if I can suss out more answers (or better observations).

codekitchen commented 6 years ago

I haven't seen this myself, but it's almost surely something to do with https://github.com/codekitchen/fsevents_to_vm which Dinghy uses to make filesystem events work across the NFS mount. So I'm not surprised that it doesn't happen with vagrant-flavored docker, since vagrant doesn't make any attempt to make fsevents work.

ms-ati commented 6 years ago

...to make the above even more specific. Is it reasonable @codekitchen to hypothesize the issue is with

https://github.com/codekitchen/fsevents_to_vm/blob/master/lib/fsevents_to_vm/ssh_emit.rb#L14

...specifically, that then unfsd is observing the touch inside the vm, and propagating back to the APFS file system a new value? Implying a hypothesis that the change to APFS in High Sierra, in some cases, is causing an unexpected mtime to propagate back out via NFS?

codekitchen commented 6 years ago

That does seem the most likely culprit, though I'm not up to speed on APFS and I haven't seen this myself, yet.

A quick glance through the unfs3 source code shows it uses utime for setting timestamps, which only has second-level precision, so it doesn't look likely that fsevents_to_vm can just switch to replicating over higher-resolution timestamps as a fix.

ms-ati commented 6 years ago

@codekitchen I've investigated a bit and this definitely makes sense -- but it doesn't compute how this could affect some, but not all users. How is it possible that not everyone using Dinghy on APFS is experiencing this?

ms-ati commented 6 years ago

Also, in principle a solution to this issue should be possible and in fact not too hard, right? I ask because, I observe that the filesystem inside dinghy is ext4, which DOES itself support nanosecond times, ultimately.

Strawman of the steps one could take here:

  1. Ensure Dinghy's base Linux vm image is configured exactly correct so that nanosecond timestamps in the ext4 filesystem are fully working
  2. Test-drive fsevents_to_vm, if host is on APFS, to propagate the full sub-second timestamp (this is probably just modifying the ruby code to read the fractional timestamp, and the touch command that we run inside the VM to write the fractional timestamp
  3. [HARD?] Patch unfsd to stat and to write nanosecond timestamps where supported

Of the above 3 steps, I feel least confident about (3).

Thoughts?

ms-ati commented 6 years ago

cc @michaeldiscala

codekitchen commented 6 years ago

Yeah that's why I dug into the unfs3 source first -- patching that would be the only difficult part, though hopefully not more than a few hours work. I don't know how much work on top of that it'd be to get the patch accepted upstream, or whether they'd require parity in the Windows back-end. It does look like NFSv3 supports nanosecond-precision timestamps, so it should hopefully be possible. It's not something I'll be able to tackle myself in the near future unfortunately.

ms-ati commented 6 years ago

@codekitchen Ok, lines up with my thinking as well. Our first step is to see if we can isolate and identify what is different in the setups of developers who are seeing this impact vs those who are not. You and I are discussing a reasonable hypothesis for how to fix a problem, but which doesn't seem to affect everyone.

Do you have any hypothesis on how the above could possibly not impact all users?

paolomainardi commented 6 years ago

@ms-ati @codekitchen i suggest you to just stick with standard nfs server implementation (es: https://github.com/adlogix/docker-machine-nfs) we are using that with Dinghy and it works fairly well (https://github.com/sparkfabrik/sparkdock)

ms-ati commented 6 years ago

@paolomainardi Thanks for those links, it looks like similar issues on High Sierra were faced in docker-machine-nfs in https://github.com/adlogix/docker-machine-nfs/issues/79 , and they fixed it with parameters to the NFS daemon?

paolomainardi commented 6 years ago

It was another kind of problem, mainly related to an operating system bug then fixed

ryansch commented 6 years ago

One thing that still confuses me: why aren't we all seeing this bug? What's different about your workflow/environment (etc)?

ms-ati commented 6 years ago

@ryansch Yes! We are furiously invalidating hypotheses as quickly as we think of them. We are (un-)lucky enough to have multiple developers who see, and who are not seeing, this issue. So we are working hard to prove any hypothesis about what is different between their setups.

paolomainardi commented 6 years ago

I am using APFS without the encryption + native nfs server and i can't face any issues here.

ryansch commented 6 years ago

@ms-ati That's quite interesting that you have devs on both sides of the fence there. Feel free to ask for help testing any hypotheses that pan out!

seckenrode commented 6 years ago

Hey all! I wanted to chime in here with some info and reproduction as well. I similarly am on an APFS volume, with encryption resulting from an in-place upgrade to High Sierra.

When dinghy is not running, I do not see any file modified errors while saving files.

no-error

When dinghy is running, I do see file modified errors, if I am saving while typing.

error

This seems to lend credence to the fact that it's related to the timestamp differences between modified timestamps mentioned above. If I save a file while I am not actively typing, my editor does not prompt with any errors:

no-error-ii

Hope that helps, happy to provide more info as needed!

ryansch commented 6 years ago

Wow @seckenrode that's excellent. There's no bug quite like a race condition!

seckenrode commented 6 years ago

To clarify an additional thing that we're finding interesting:

The test_file.rb is in my /Users/seckenrode/Desktop directory. I do not have any docker containers actively running at the time of this test, and I do not have any docker volumes mounted AFAIK. The only difference between the first and second/third GIFs here is that I ran dinghy up between the two.

I'm not sure what that means, w.r.t. the fsevents_to_vm and the unsfd server here.

Thanks!

ms-ati commented 6 years ago

Another clarification: I thought that I wasn't seeing the same issue as @seckenrode. But in fact I am! It's just that I wasn't typing and saving as quickly together as he was.

Steps to reproduce on my machine:

  1. dinghy up
  2. Open a file anywhere under my home directory in Sublime Text 3
  3. While typing quickly, his Cmd-S to save, like "Hello World"

What I observe is, if I type quickly the Cmd-S, I see the same as @seckenrode gif above

ms-ati commented 6 years ago

It appears that the crucial reproduction step is: typing in the editor within 1 second after saving.

This appears to trigger the error in Sublime Text 3 every time. It appears the difference between our developers with and without the issue was more in their typing and saving speed and habits than in their machine configurations?

ms-ati commented 6 years ago

...it appears that Emacs has this worse, as by default it asks the user every time a file modification time is changed, not just when the buffer has changes that differ from disk.

This behavior can be changed by the user to do what Sublime Text 3 does (compare the buffer contents to the disk contents to avoid warning when no changes) through the use of Emacs lisp, I think: https://emacs.stackexchange.com/questions/21137/fix-changed-on-disk-reread-from-disk-when-file-has-not-changed

codekitchen commented 6 years ago

Nice. Right now Dinghy replicates all FS events in your home dir to the docker-machine VM, it doesn't try and work out which folders are currently mounted as volumes and only replicate those. So it makes sense that all you have to do is dinghy up to see the issue.

I think I've covered in the FAQ why using the native NFS server isn't possible for our use cases, but I bet it works well for those who don't also need NFS for other purposes such as Vagrant shares. How much did you have to modify Dinghy to make that setup work?

Using the native NFS server might be a good way to confirm that modifying fsevents_to_vm to touch with sub-second timestamps does fix the issue, before trying to modify unfs3, assuming the native NFS server does do sub-second timestamps.

ms-ati commented 6 years ago

Makes a lot of sense @codekitchen. We are evaluating on our side whether we have the time and appetite to push this further, will let you know.

thierrymichel commented 6 years ago

It seems that you are aware of the problem. Just to say that we encounter the same issue in applications like Paw or Coderunner. But everything works fine with Visual Studio Code…

If it can help we also have strange behaviour with Dropbox. Some files, like *.key (Keynote), are modified without any user actions, can not be opened by others or generates conflicted versions…

If you want more informations, just tell me.

And, last but not least, thanks for the hard work! :+1:

ms-ati commented 6 years ago

Regarding a path forward for fixing UNFS3 on High Sierra to support the nanosecond timestamps: I've filed an "Is this project alive?" bug with UNFS3 on SourceForge: https://sourceforge.net/p/unfs3/bugs/10/

If no response in one week, what would that mean? Well, it would mean that in addition to making and testing a High Sierra fix for UNFS3, we'd also have to become the de-facto upstream for the version published for Mac OS via Homebrew.

Please let me know ^ if you have thoughts on the above @codekitchen ?

ms-ati commented 6 years ago

Unfortunately, as you can see in the discussion with UNFS3 authors on SourceForge, that project appears to be effectively dead with no upstream authors to collaborate with.

Unless someone else emerges who wants to take on development of UNFS3 to the extent that they can collaborate with us in adding support for nanosecond timestamps on APFS on High Sierra, and take on responsibility for an upstream repository and Homebrew recipe, then we are probably not going to pursue this any further.

We are investigating whether Docker for Mac has improved sufficiently in recent months to be a viable replacement for Dinghy for our use cases.

derfian commented 6 years ago

Despite @ms-ati's claims that there are no upstream developers to collaborate with, I spent some time this weekend digging into this issue and came up with something. I've got a git branch of unfs3 with some code that should allow for microsecond or nanosecond resolution access/modify timestamps, on platforms where this is supported. I haven't really tested the code much yet, so beware of bugs.

ms-ati commented 6 years ago

Hi @derfian this is fantastic news! We're so happy to hear about this. What sort of collaboration might be most helpful to you to get your branch towards something ready for production? Is there a test suite we could adapt for it, for example?

ms-ati commented 6 years ago

During my lunch break, what I am trying today: Following the Homebrew instructions to make a proper pull request against Dinghy in order to use @derfian 's new branch for unfs3.

I'll report back here with

  1. Does it actually work on my High Sierra system
  2. A link to the PR in question to discuss further

Any thoughts?

ms-ati commented 6 years ago

Ok. To see my progress on testing @derfian UNFS3 changes in Dinghy thus far, please see this gist: https://gist.github.com/ms-ati/aaafdf091e1b04c4621aeec18cbfab4c

I will update as I progress.

Based on dtruss -n unfsd output, I am seeing calls to lstat64, fstat64, and setattrlistat. I am not seeing any calls to utime, utimes, utimesat, nor clock_gettime, when I edit a file and watch the dtruss output.

In order to view the local MacOS nanosecond timestamps, I did brew install coreutils and gls --full-time.

ms-ati commented 6 years ago

Here's my branch of the Dinghy Homebrew Tap where I am developing, pulling in @derfian 's branch of UNFS3, and soon to pull in my own fork of Dinghy to get the necessary support on the fsevents_to_vm side:

https://github.com/ms-ati/homebrew-dinghy/tree/use-unfs3-branch-higher-resolution-setattr

Thoughts? Comments?

ms-ati commented 6 years ago

Update: It appears that the above-referenced branch of unfs3 by @derfian (Thanks! 👍) is

a. Correctly setting the nanosecond timestamps to the Mac OS server host, but b. Incorrectly reading the nanosecond timestamps from the Mac OS server host -- they appear as zeros to the client Ubuntu Linux system.

Details reported on SourceForge for UNFS3 here: https://sourceforge.net/p/unfs3/bugs/10/#f004

ms-ati commented 6 years ago

Hi @derfian -- just wondering if you thought you might have a chance to look at the above comments on your nanosecond-timestamps branch of UNFS3?

Have a great weekend and thanks again!

ms-ati commented 6 years ago

Great news everyone!

I've verified that @derfian latest code on his branch of unfs3 works, please see my latest comment on sourceforge for details and steps to reproduce correct nanosecond behavior with Dinghy.

What is left to do? Well, the correct behavior is only observed when fsevents_to_vm is disabled. When it is running, the timestamps are set to zero nanoseconds again.

So, the next step is to get fixes into the fsevents_to_vm stack so that nanosecond timestamps are preserved.

HUGE THANK YOU to @derfian -- do you have plans to cut a new official release of unfs3 from your branch code?

ms-ati commented 6 years ago

@codekitchen Is there anyone who can help investigate the fsevents_to_vm side with me?

codekitchen commented 6 years ago

That's awesome. I should have time to poke at fsevents_to_vm later next week. If anybody else wants to look before then, I think all that may need to change is the code that reads the mtime on this line. Unless I'm forgetting something, which is possible as it's been a long time.

ms-ati commented 6 years ago

Hi @codekitchen there's an additional wrinkle. I have a branch of fsevents_to_vm which correctly formats the mtime using Time#iso8601(9) which is the right way to get nanoseconds, and on a Ubuntu container setting using touch -d works correctly.

However, the boot2docker host that fsevents_to_vm is actually ssh'ing into is actually based on BusyBox, and so the touch command is just BusyBox rather than Gnu Coreutils touch. The point of this digression is that the touch command there does NOT support nanoseconds.

One path forward that I was going to try was to have dinghy run an additional container which does contain a nanoseconds-enabled touch command, just for the purpose of replacing the ssh-to-touch with docker exec to touch. Any thoughts on that?

Could the existing Proxy container be extended to (a) mount the NFS-share of the host home directory, and (b) support docker exec from fsevents_to_vm to run the touch command?

ms-ati commented 6 years ago

@codekitchen To restate more concisely what I was saying in previous comment. First, disable your fsevent_to_vm in .dinghy/preferences.yml or just kill it. Then try this:

A. WORKING: Ubuntu's touch (from gnu coreutils) accepts nanoseconds:

# Outside on host Mac OS
$ touch ~/test_file && gls --full-time ~/test_file
-rw-r--r-- 1 username staff 0 2018-05-17 10:39:00.592204127 -0400 /Users/username/test_file
$ docker run -it -v $HOME:/fstest ubuntu:16.04 bash

# Inside Ubuntu container
root@91e1e49f6bf2:/# touch -m -c -d 2018-05-17T14:41:04.730369000Z /fstest/test_file
root@91e1e49f6bf2:/# ls --full-time /fstest/test_file
-rw-r--r-- 1 501 dialout 0 2018-05-17 14:41:04.730369000 +0000 /fstest/test_file

# Outside on host Mac OS again
gls --full-time ~/test_file
-rw-r--r-- 1 username staff 0 2018-05-17 10:41:04.730369000 -0400 /Users/username/test_file

B: BROKEN: Boot2Docker's touch (from BusyBox) rejects nanoseconds:

# Outside on host Mac OS
$ dinghy ssh

# Inside Dinghy's Boot2Docker VM
docker@dinghy:~$ touch -m -c -d 2018-05-17T14:44:12.503968000Z /Users/username/test_file
touch: invalid date '2018-05-17T14:44:12.503968000Z'
ms-ati commented 6 years ago

So, given that we've replaced the format_time in fsevents_to_vm with time.utc.8601(9), which gives you a nanosecond representation in the right format:

In the solution space that I was referencing above, I'm aware of two main shapes to get this timestamp set correctly inside the VM to trigger the inotify events in Linux:

1. Continue to SSH to BusyBox

If we are continuing to ssh into the VM and run a command, we need something other than BusyBox's touch. Possibly we could download a static binary here or otherwise add the command we need to the BusyBox in the VM?

2. Exec inside a Docker container instead

Alternatively, we could use docker exec to run a command inside a Docker container which does have the gnu coreutils version of touch. Since dinghy already runs the http proxy container, I suggested that as one alternative; however we could just as easily run a separate small container just for this purpose.

Thoughts @codekitchen and others?

ms-ati commented 6 years ago

Hi @codekitchen I know you're busy and hate to be a bother, but would really appreciate your thoughts on which direction above so I don't waste my time on a PR that you wouldn't accept. Make sense?

ms-ati commented 6 years ago

Per agreement with @derfian on the unfs3 sourceforge issue here:

I have a few portability concerns about the new code that I'd like to review before merging to master. Making a new release isn't at the top of my priority list at the moment.

No worries, I understand. In the meantime, would you be willing to commit to keeping your existing Github branch available at the same path? Could we temporarily make dinghy depend on that branch to build, to get Mac OS docker users back into good shape, and later change the dependency to be your next release?

Sure, I will do that. I don't mind having the dinghy team depend on this branch until everything else settles into place.

I will be submitting a PR to homebrew-dinghy to officially switch the unfs3 dependency to his branch implementing nanosecond support, until such a time in the future as @derfian is able to finish testing cross-platform support and make an official release.