dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
539 stars 29 forks source link

Problems viewing notebook diffs #757

Closed alaric-dotmesh closed 4 years ago

alaric-dotmesh commented 4 years ago

FuseMachines are reporting multiple issues with the notebook diff view breaking, eg https://cloud.dotscience.com/project/229bf822-d3cb-403e-af6d-fde9a5c07218/runs/file/audio-tag%252Fnotebooks%252FCNN.ipynb/cf2d5227-61a0-4e51-85c3-cfd30f867e12?page=0

Seems to be a problem with the S3 interface used to fetch the notebook files; the FSMs are going into the backoff state and we're not sure why.

alaric-dotmesh commented 4 years ago

Gotcha:

time="2019-10-14T09:19:09Z" level=info msg="<transition> 66429de2-487e-4df3-839e-3e1c7a1f7a66 to active waiting (from active waiting, 0.83s ago)"
time="2019-10-14T09:19:09Z" level=info msg="entering active state for 66429de2-487e-4df3-839e-3e1c7a1f7a66"
time="2019-10-14T09:19:09Z" level=debug msg="[activeState] moving to next state" from=fileOutputIO
time="2019-10-14T09:19:09Z" level=info msg="<transition> 66429de2-487e-4df3-839e-3e1c7a1f7a66 to backoff pausing (from active waiting, 0.00s ago)"
2019/10/14 09:19:09 entering backoff state for 66429de2-487e-4df3-839e-3e1c7a1f7a66
alaric-dotmesh commented 4 years ago

Further logging reveals the next stage in the problem:

time="2019-10-14T15:04:40Z" level=error msg="[readFile] Error statting" error="stat /var/lib/dotmesh/mnt/dmfs/66429de2-487e-4df3-839e-3e1c7a1f7a66@3965e857-246c-4e8a-931a-6860000dd80a/__default__/audio-tag/notebooks/CNN.ipynb: no such file or directory" filename=audio-tag/notebooks/CNN.ipynb sourcePath=/var/lib/dotmesh/mnt/dmfs/66429de2-487e-4df3-839e-3e1c7a1f7a66@3965e857-246c-4e8a-931a-6860000dd80a/__default__/audio-tag/notebooks/CNN.ipynb
alaric-dotmesh commented 4 years ago

The directory is there, but no CNN.ipynb inside it.

...is the problem that we're comparing the snapshot that CREATED CNN.ipynb to the snapshot before, when it didn't exist?

It seems so! There's two snapshots mounted:

The diff URL that's showing the 500 error is:

https://cloud.dotscience.com/diff? base=%2Fv2%2Fprojects%2F229bf822-d3cb-403e-af6d-fde9a5c07218%2Ffiles%2Fsnapshot%2F3965e857-246c-4e8a-931a-6860000dd80a%2Faudio-tag%252Fnotebooks%252FCNN.ipynb%3FshortLivedToken%3DeyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2NvdW50X2lkIjoiOGE5ODA4MjAtOTM5MS00NTdlLTkyMDktZjczZTRjOGU5MjM3IiwiZW1haWwiOiJrc2hpdGl6Lm1hbmRhbEBmdXNlbWFjaGluZXMuY29tIiwiZXhwIjoxNTcxMTUxODc3LCJpYXQiOjE1NzEwNjU0NzcsInN1YiI6IjhhOTgwODIwLTkzOTEtNDU3ZS05MjA5LWY3M2U0YzhlOTIzNyIsInVzZXJuYW1lIjoia3NoaXRpem1hbmRhbCJ9._fiA2Ym4bxd5OtC0Iy5WFj36tHWZPtD3kjzJnaI_61MRTx0q1BhQIQLSNNXA95vJ36qALjMGFFwxN8c47ASJyg &remote=%2Fv2%2Fprojects%2F229bf822-d3cb-403e-af6d-fde9a5c07218%2Ffiles%2Fsnapshot%2F6ef5e876-bddf-44ee-bbb7-7ed3cbfeae78%2Faudio-tag%252Fnotebooks%252FCNN.ipynb%3FshortLivedToken%3DeyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2NvdW50X2lkIjoiOGE5ODA4MjAtOTM5MS00NTdlLTkyMDktZjczZTRjOGU5MjM3IiwiZW1haWwiOiJrc2hpdGl6Lm1hbmRhbEBmdXNlbWFjaGluZXMuY29tIiwiZXhwIjoxNTcxMTUxODc3LCJpYXQiOjE1NzEwNjU0NzcsInN1YiI6IjhhOTgwODIwLTkzOTEtNDU3ZS05MjA5LWY3M2U0YzhlOTIzNyIsInVzZXJuYW1lIjoia3NoaXRpem1hbmRhbCJ9._fiA2Ym4bxd5OtC0Iy5WFj36tHWZPtD3kjzJnaI_61MRTx0q1BhQIQLSNNXA95vJ36qALjMGFFwxN8c47ASJyg

...and those are the snapshot IDs embedded in that URL.

Suggestions:

  1. We probably don't need to go into backoff state when a file is not found in this case, but it's probably not doing any harm (now we have readFile logging). I'll remove the moving to next state logging but propose the readFile logging for master, and add the same sort of logging in saveFile.

  2. We also need to return a useful 404 error when we request a nonexistant file via S3, to help spot this kind of problem faster in future. I'll see if I can do that too.

  3. We need to think about the desired UX for looking at a diff that creates (or deletes?) a file and implement that!

alaric-dotmesh commented 4 years ago

Product feedback on point 3: https://dotscience.slack.com/archives/C7PHSRL82/p1571070744378500

alaric-dotmesh commented 4 years ago

PR for point 1: https://github.com/dotmesh-io/dotmesh/pull/758

alaric-dotmesh commented 4 years ago

PR for point 2: https://github.com/dotmesh-io/dotmesh/pull/759

alaric-dotmesh commented 4 years ago

Point 3 is to be resolved by:

  1. Adding HEAD support to the S3 API so we can cheaply check for nonexistant files
  2. Making the nbdiff proxy in the gateway use the S3 HEAD support to detect when a diff will fail in this case (see https://github.com/dotmesh-io/gateway/compare/dotmesh-757?expand=1 for the code implementing this)
  3. Changing the frontend to detect the 400 status code returned by the nbdime proxy and displaying something else (TBD)