fog / fog-aliyun

Fog provider for aliyun
MIT License
37 stars 24 forks source link

Droplets in the blobstore are deleted on stop/restart of CF application, with fog-aliyun v0.3.8 #61

Closed OlegGerber closed 4 years ago

OlegGerber commented 4 years ago

We have evaluated fog-aliyun v0.3.8 on our AlibabaCloud CF landscape and encountered some problems with the blobstore.

Setup: cf-deployment v12.39.0 capi-release 1.91.0-sap.2

The capi-release patched to use fog-aliyun in v0.3.8.

Deployment of Cloud Foundry is successful. A first push of an application also works. But after stopping and restarting and application, the droplets in the blobstore are deleted. Apps cannot be started any more.

We suspect that directory names are not correctly determined in this part of the coding: https://github.com/fog/fog-aliyun/blob/876e7c570eb082af06162d4fae992a5e69f7906a/lib/fog/aliyun/models/storage/files.rb#L23. It could be that the path to the "buildpack_cache" folder is incorrectly truncated and so parent folders are deleted as well. The "buildpack_cache" content is cleaned up by the cc-worker jobs.

xiaozhu36 commented 4 years ago

HI @OlegGerber I can not recurred this issue by deploying application spring-music, and according to your steps, the droplets will not deleted. What is your oss-blobstore config: use-alicloud-oss-blobstore.yml or use-alicloud-oss-blobstore-to-multi-bucket.yml? The method check_directory_key(directory_key) used to resolve create duplicate folder issue when using multi-bucket.

FloThinksPi commented 4 years ago

Hi @xiaozhu36 , ive uploaded our cloud_controller_ng release to https://github.com/FloThinksPi-Forks/cloud_controller_ng/tree/v3.81.0-sap.2-test You will find a file to reproduce the exact issue under bin/fog_aliyun_test.rb which you can then debug. We found that a cleanup job in CC_Worker periodically cleans up the buildpack cache but instead of just cleaning the subfolder "builpack_cache" in the droplets bucket, it deletes all droplets. Once deleted stopped apps cannot start again because their droplets are missing. In above test file we debugged the CC initialisation and function calls and build this simplification which thus mimics the behaviour of this very clean up job. As this does not happen in other fog-libs for different infrastructures we suspect someting in the path calculation gets twisty but we did not went that deep down the path.

Hope this helps : )

FloThinksPi commented 4 years ago

Upon closer inspection it seems like CC is choosing a subfolder by supplying a prefix e.g. connection.directories.get("", prefix: "myfolder/myfile").files Attached(images) are the debug states with the respective variables supplied to fog. The prefix value is not used in the directories.get function or somewhere down the path as all objects are returned that are in the blobstore not only those beginning with this path/prefix. In other implementations (fog-aws) the prefix is correctly used to limit the returned files/folders to the ones having this prefix. https://github.com/fog/fog-aws/blob/daa50bb3717a462baf4d04d0e0cbfc18baacb541/lib/fog/aws/requests/storage/get_bucket.rb#L66-L71 Screenshot 2020-04-22 at 13 27 36 Screenshot 2020-04-22 at 13 18 48

FloThinksPi commented 4 years ago

To clarify again our issue and requirements we decided to summarise everything once again and translate it into mandarin to overcome potential language barriers.

Setup

Given is a landscape of Cloud-foundry with a Cloud_controller and multiple Cloud_Controller_Worker VMs. The version of CC is https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb which is esentially V3.81.0 but with fog-aliyun patched to v0.38.0 to fix the double folder issue we previously had. We use following ops-file in cf-deployment to configure multibucket fog-aliyun: https://github.com/cloudfoundry/cf-deployment/blob/master/operations/use-alicloud-oss-blobstore-to-multi-bucket.yml

Actual behaviour

We now Push an app, stop it, start it again and see it cannot be started because droplet not found error appears.

We narrowed down whats happening:

  1. CC Uploads Blobstore_cache files and builds droplets and saves it in the droplet bucket
  2. A cleanup Job runs periodically on the CC_Worker that purges old Blobstore_Cache files with this function https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L96.
  3. Instead of cleanung just the Blobstore_Cache the whole droplet bucket gets deleted

We narrowed it down even further

  1. When calling following function https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L96
  2. This function request and returns all files that have a choosen path (passed as prefix into that function) https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L116
  3. This prefix is passed to fog-aliyun here:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121
  4. function https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121 returns not only files under a given path (prefix) instead it returns all files beginning from the root folder of the droplet bucket.
  5. The CloudController will now call the destroy operation on every file returned from the files_for function: https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L99 Which turns out to be not the files limited to the folder that was passed via prefix. As files_for, more correctly the fog-aliyun function https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121 returns all files beginning from root, all files in the droplet will be deleted and not just the ones under a given subpath passed as prefix into those functions.

The result is everytime the cleanup jobs run on a CC_Worker the whole droplet bucket gets deleted because of above detailed chain of events.

Expected Behaviour

We now Push an app, stop it, start it again and this works.

  1. The files_for function returns just wanted files that MATCH the path of the prefix variable.
  2. more specifically https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121 returns JUST the files that are matching the path supplied as prefix.
  3. Thus the delete operation will just delete the files under the wanted path
  4. The CC_Worker job just purges the wanted path an nothing else.
  5. Thus droplets are still in the bucket and starting an app works

Example code to reproduce and debug

We build https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb so you can easily debug what is happening. We commented the code to show you what is intended and what is actually happening.

To run this test:

  1. Clone the Repo
  2. Run bundle install to install all required ruby gems.
  3. execute the file bin/fog_aliyun_test.rb and debug it.
  4. Set a breakpoint here https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/355d1eba6017d62c8808d655c490d6c8fefe4ba8/bin/fog_aliyun_test.rb#L96

What can be obseved at the breakpoint ?

  1. You have a bucket filled with the file structure described in the codes comments.
  2. Now step over the function https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/355d1eba6017d62c8808d655c490d6c8fefe4ba8/bin/fog_aliyun_test.rb#L96
  3. Obseve the whole blobstore beeing deleted and not just the files under the path blobstore_cache/92/6c/926cdf95-7228-40a3-995a-cf94ce68586b

You can now debug further down into fog-aliyun code and see why this happens (basically see above description)

WeiQuan0605 commented 4 years ago

为了能更好的解释问题和阐明需求,我们决定重新总结所有操作,并且将之翻译成汉语来克服潜在的语言障碍。

Setup(设置)

使用场景:有一个Cloud_Cotroller 和多个Cloud_Cotroller_Worker VM的Cloud Foundry。Cloud Controller(以下简称:CC)的版本是:https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb 本质来说CC的版本还是:V3.81.0,但是我们使用fog-aliyun 的补丁(版本:V0.38.0)来解决我们之前的双文件夹问题. 我们使用cf-deployment的配置文件来配置fog-aliyun的多bucket问题。Cf-deployment配置文件的link为:https://github.com/cloudfoundry/cf-deployment/blob/master/operations/use-alicloud-oss-blobstore-to-multi-bucket.yml

操作后的实际结果

我们做了:1. Push 一个app,2. Stop这个app, 3. 再start这个app,这3个步骤。我们发现第3步,start这个app不能实现,问题是:‘droplet not found’,即不能发现droplet。

我们将问题范围缩小到:

  1. CC上传Blobstore_cache文件并构造了多个droplets。这些文件(blobstore_cache和droplets)被存储在droplet bucket中。
  2. CC_woker定期地运行清除工作,将旧的Blobstore_Cache文件清除掉。用的是这个function:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L96.
  3. 然而我们发现:不仅仅是Blobstore_Cache文件被删除了,droplet bucket整个被删除了。

我们将问题的范围进一步缩小到:

  1. 当我们call一个function时,function link为:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L96
  2. 步骤1的function(对应function link 在后面)请求一个path作为input,这个path将以prefix形式被传递给下面的function。Function Link 为:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L116 这个function返回所有选择的path下的文件。
  3. 这个prefix(前缀)被传递给fog-aliyun。Link为:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121
  4. 这个Function(link: https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121) 不返回仅给定路径下(prefix)的文件,而是返回了droplet bucket根目录下的所有文件。
  5. CC将删除由 files_for function返回的文件。删除function的link:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L99 事实证明返回的文件并仅仅是给定路径下(即prefix下)的文件, files_for函数(link:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121) 返回了所有根目录以下的文件。所有在droplet bucket的文件被删除,而不是传递到这个function给定的路径下(即prefix下)的文件。

结果就是:每次CC_Worker 运行清理工作,因为上述的一系列操作和影响,所有droplet bucket都被删除了。

期待的结果

我们进行:1. Push一个app 2. Stop这个app 3. 再start这个app 这些操作都是没问题的。

  1. files_forfunction仅返回我们需要的文件。这些文件的路径应该与 ’prefix’一致。
  2. 更详细地说:这个function(link:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121) 仅仅返回给定路径下(即prefix下)的文件。
  3. 并且delete function(link: https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L99) 仅仅删除给定路径下的文件。
  4. CC_Worker只清除需要清除的路径下的文件,而不是全部文件。
  5. droplet还是在bucket中,并可以start app。

重现和调试的示例代码

我们做了一个测试用例,link: https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb 你可以用它来看发生了什么和debug。我们注释了部分代码,以便你能知道我们的意图和看到实际发生了什么。

运行这个test的步骤:

  1. clone这个Repo。
  2. Run bundle install 来安装所需的 ruby gems。
  3. 执行bin/fog_aliyun_test.rb文件,并debug。
  4. 在这里,link:https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/355d1eba6017d62c8808d655c490d6c8fefe4ba8/bin/fog_aliyun_test.rb#L96 加一个breakpoint。

在breakpoint处我们可以观察到什么?

  1. 你有一个bucket,里面的文件结构如code中评论的那样。
  2. 现在step over function,function link:https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/355d1eba6017d62c8808d655c490d6c8fefe4ba8/bin/fog_aliyun_test.rb#L96
  3. 可以看到所有的blobstore被删除了,而不是仅有路径blobstore_cache/92/6c/926cdf95-7228-40a3-995a-cf94ce68586b下的文件。

现在你可以进一步深入到fog-aliyu的代码中去调试,看看为什么会出现这种情况(基本见上面的描述)。

xiaozhu36 commented 4 years ago

HI @FloThinksPi Thanks for your feedback. Unfortunately, I can not run bundle install successfully in my laptop because of can not install several gems. But, I got your points and I have improved the directories.get: https://github.com/xiaozhu36/fog-aliyun/blob/master/lib/fog/aliyun/models/storage/directories.rb#L55 Can you have a test based on it?

xiaozhu36 commented 4 years ago

@FloThinksPi In addition, I still can not reproduce your case based on app spring-music. If you can provide an app for me, it will help me to locate the final issue.

FloThinksPi commented 4 years ago

@xiaozhu36 getting

Uncaught exception: undefined method `chomp' for ["ali-dev23-cf-droplets-l9kn68xc"]:Array
    /Users/i507599/.bundle/ruby/2.6.0/bundler/gems/fog-aliyun-0b27da886b45/lib/fog/aliyun/models/storage/files.rb:27:in `check_directory_key'
    /Users/i507599/.bundle/ruby/2.6.0/bundler/gems/fog-aliyun-0b27da886b45/lib/fog/aliyun/models/storage/files.rb:50:in `all'
    /Users/i507599/.bundle/ruby/2.6.0/bundler/gems/fog-aliyun-0b27da886b45/lib/fog/aliyun/models/storage/files.rb:79:in `each'
    /Users/i507599/Git/cloud_controller_ng/lib/cloud_controller/blobstore/fog/fog_client.rb:161:in `delete_files'
    /Users/i507599/Git/cloud_controller_ng/lib/cloud_controller/blobstore/fog/fog_client.rb:99:in `delete_all_in_path'
    /Users/i507599/Git/cloud_controller_ng/bin/fog_aliyun_test.rb:115:in `<module:Blobstore>'
    /Users/i507599/Git/cloud_controller_ng/bin/fog_aliyun_test.rb:11:in `<module:CloudController>'
    /Users/i507599/Git/cloud_controller_ng/bin/fog_aliyun_test.rb:10:in `<top (required)>'

Please dont provide just random snippets, that have never been executed once. We can have a call to get the test setup working on your machine if you`d like to. How do you want to fix it without beeing able to reproduce it ?

What is your test setup for stopping and starting your app spring-music how is the landscape configured. Do you use the multi-bucket-ali ops file ? How many CloudControllers VMs and CloudController_Worker VMs do you have ? This happens with any app as the cleanup job running on the CloudController_Worker deletes all droplets so all "contrainer images" for all apps. Thus every app is unable to start once stopped/scaled or moved to another diego-cell.

xiaozhu36 commented 4 years ago

HI @FloThinksPi Thanks for your feedback.

  1. I would like have a call to show my environment and need your help to fix ruby issue
  2. My cf deployment uses multi-bucket-ali ops file and it is a smallest cf spec that only have one CloudControllers vm. Is it the reason that cause I can not reproduce ?
FloThinksPi commented 4 years ago
  1. Sure sent you a private msg in slack
  2. Yea you need at least one CloudController_Worker VM which is differemt to CloudController VM. The CC_Worker VM does not provide/answers the API instead it runs background task that get scheduled by the CloudController(API) VMs. On of these jobs is the deletion of old blobs which causes the issue. Having no CC_Worker VM you will likely never run this background task and thus the delete operation will not be executed.

Thats the reason why we created the ruby snippet to reproduce the issue without any CF environment to make it much simpler. We will loook into getting this to work on your machine.

WeiQuan0605 commented 4 years ago

I think @xiaozhu36 already has a CC_Worker VM, which is theoretically sufficient for the backgorund task. It's probably not because of the lack of CC_Worker VMs. The specific reason for not being able to reproduce our debug scenes, i also think, it's better to have call with @xiaozhu36. This is probably the quickest way to find out the problem.

xiaozhu36 commented 4 years ago

Fixed by 0.3.17.