Closed OlegGerber closed 4 years ago
HI @OlegGerber I can not recurred this issue by deploying application spring-music, and according to your steps, the droplets will not deleted.
What is your oss-blobstore config: use-alicloud-oss-blobstore.yml or use-alicloud-oss-blobstore-to-multi-bucket.yml?
The method check_directory_key(directory_key)
used to resolve create duplicate folder issue when using multi-bucket.
Hi @xiaozhu36 , ive uploaded our cloud_controller_ng release to https://github.com/FloThinksPi-Forks/cloud_controller_ng/tree/v3.81.0-sap.2-test You will find a file to reproduce the exact issue under bin/fog_aliyun_test.rb which you can then debug. We found that a cleanup job in CC_Worker periodically cleans up the buildpack cache but instead of just cleaning the subfolder "builpack_cache" in the droplets bucket, it deletes all droplets. Once deleted stopped apps cannot start again because their droplets are missing. In above test file we debugged the CC initialisation and function calls and build this simplification which thus mimics the behaviour of this very clean up job. As this does not happen in other fog-libs for different infrastructures we suspect someting in the path calculation gets twisty but we did not went that deep down the path.
Hope this helps : )
Upon closer inspection it seems like CC is choosing a subfolder by supplying a prefix e.g.
connection.directories.get("", prefix: "myfolder/myfile").files
Attached(images) are the debug states with the respective variables supplied to fog. The prefix value is not used in the directories.get function or somewhere down the path as all objects are returned that are in the blobstore not only those beginning with this path/prefix. In other implementations (fog-aws) the prefix is correctly used to limit the returned files/folders to the ones having this prefix. https://github.com/fog/fog-aws/blob/daa50bb3717a462baf4d04d0e0cbfc18baacb541/lib/fog/aws/requests/storage/get_bucket.rb#L66-L71
To clarify again our issue and requirements we decided to summarise everything once again and translate it into mandarin to overcome potential language barriers.
Given is a landscape of Cloud-foundry with a Cloud_controller and multiple Cloud_Controller_Worker VMs. The version of CC is https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb which is esentially V3.81.0 but with fog-aliyun patched to v0.38.0 to fix the double folder issue we previously had. We use following ops-file in cf-deployment to configure multibucket fog-aliyun: https://github.com/cloudfoundry/cf-deployment/blob/master/operations/use-alicloud-oss-blobstore-to-multi-bucket.yml
We now Push an app, stop it, start it again and see it cannot be started because droplet not found
error appears.
We narrowed down whats happening:
We narrowed it down even further
prefix
into that function) https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L116prefix
) instead it returns all files beginning from the root folder of the droplet bucket.files_for
function: https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L99 Which turns out to be not the files limited to the folder that was passed via prefix
. As files_for
, more correctly the fog-aliyun function https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121 returns all files beginning from root, all files in the droplet will be deleted and not just the ones under a given subpath passed as prefix
into those functions.The result is everytime the cleanup jobs run on a CC_Worker the whole droplet bucket gets deleted because of above detailed chain of events.
We now Push an app, stop it, start it again and this works.
files_for
function returns just wanted files that MATCH the path of the prefix
variable.prefix
.We build https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb so you can easily debug what is happening. We commented the code to show you what is intended and what is actually happening.
To run this test:
bundle install
to install all required ruby gems.bin/fog_aliyun_test.rb
and debug it.blobstore_cache/92/6c/926cdf95-7228-40a3-995a-cf94ce68586b
You can now debug further down into fog-aliyun code and see why this happens (basically see above description)
为了能更好的解释问题和阐明需求,我们决定重新总结所有操作,并且将之翻译成汉语来克服潜在的语言障碍。
使用场景:有一个Cloud_Cotroller 和多个Cloud_Cotroller_Worker VM的Cloud Foundry。Cloud Controller(以下简称:CC)的版本是:https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb 本质来说CC的版本还是:V3.81.0,但是我们使用fog-aliyun 的补丁(版本:V0.38.0)来解决我们之前的双文件夹问题. 我们使用cf-deployment的配置文件来配置fog-aliyun的多bucket问题。Cf-deployment配置文件的link为:https://github.com/cloudfoundry/cf-deployment/blob/master/operations/use-alicloud-oss-blobstore-to-multi-bucket.yml
我们做了:1. Push 一个app,2. Stop这个app, 3. 再start这个app,这3个步骤。我们发现第3步,start这个app不能实现,问题是:‘droplet not found’,即不能发现droplet。
我们将问题范围缩小到:
我们将问题的范围进一步缩小到:
prefix
形式被传递给下面的function。Function Link 为:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L116 这个function返回所有选择的path下的文件。prefix
(前缀)被传递给fog-aliyun。Link为:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121prefix
)的文件,而是返回了droplet bucket根目录下的所有文件。files_for
function返回的文件。删除function的link:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L99
事实证明返回的文件并仅仅是给定路径下(即prefix
下)的文件, files_for
函数(link:https://github.com/cloudfoundry/cloud_controller_ng/blob/45a9d110c457b56089b3dc70b9b75228e453936a/lib/cloud_controller/blobstore/fog/fog_client.rb#L121) 返回了所有根目录以下的文件。所有在droplet bucket的文件被删除,而不是传递到这个function给定的路径下(即prefix
下)的文件。结果就是:每次CC_Worker 运行清理工作,因为上述的一系列操作和影响,所有droplet bucket都被删除了。
我们进行:1. Push一个app 2. Stop这个app 3. 再start这个app 这些操作都是没问题的。
files_for
function仅返回我们需要的文件。这些文件的路径应该与 ’prefix’一致。prefix
下)的文件。我们做了一个测试用例,link: https://github.com/FloThinksPi-Forks/cloud_controller_ng/blob/v3.81.0-sap.2-test/bin/fog_aliyun_test.rb 你可以用它来看发生了什么和debug。我们注释了部分代码,以便你能知道我们的意图和看到实际发生了什么。
运行这个test的步骤:
bundle install
来安装所需的 ruby gems。bin/fog_aliyun_test.rb
文件,并debug。blobstore_cache/92/6c/926cdf95-7228-40a3-995a-cf94ce68586b
下的文件。现在你可以进一步深入到fog-aliyu的代码中去调试,看看为什么会出现这种情况(基本见上面的描述)。
HI @FloThinksPi Thanks for your feedback. Unfortunately, I can not run bundle install
successfully in my laptop because of can not install several gems.
But, I got your points and I have improved the directories.get: https://github.com/xiaozhu36/fog-aliyun/blob/master/lib/fog/aliyun/models/storage/directories.rb#L55
Can you have a test based on it?
@FloThinksPi In addition, I still can not reproduce your case based on app spring-music
. If you can provide an app for me, it will help me to locate the final issue.
@xiaozhu36 getting
Uncaught exception: undefined method `chomp' for ["ali-dev23-cf-droplets-l9kn68xc"]:Array
/Users/i507599/.bundle/ruby/2.6.0/bundler/gems/fog-aliyun-0b27da886b45/lib/fog/aliyun/models/storage/files.rb:27:in `check_directory_key'
/Users/i507599/.bundle/ruby/2.6.0/bundler/gems/fog-aliyun-0b27da886b45/lib/fog/aliyun/models/storage/files.rb:50:in `all'
/Users/i507599/.bundle/ruby/2.6.0/bundler/gems/fog-aliyun-0b27da886b45/lib/fog/aliyun/models/storage/files.rb:79:in `each'
/Users/i507599/Git/cloud_controller_ng/lib/cloud_controller/blobstore/fog/fog_client.rb:161:in `delete_files'
/Users/i507599/Git/cloud_controller_ng/lib/cloud_controller/blobstore/fog/fog_client.rb:99:in `delete_all_in_path'
/Users/i507599/Git/cloud_controller_ng/bin/fog_aliyun_test.rb:115:in `<module:Blobstore>'
/Users/i507599/Git/cloud_controller_ng/bin/fog_aliyun_test.rb:11:in `<module:CloudController>'
/Users/i507599/Git/cloud_controller_ng/bin/fog_aliyun_test.rb:10:in `<top (required)>'
Please dont provide just random snippets, that have never been executed once. We can have a call to get the test setup working on your machine if you`d like to. How do you want to fix it without beeing able to reproduce it ?
What is your test setup for stopping and starting your app spring-music
how is the landscape configured. Do you use the multi-bucket-ali ops file ? How many CloudControllers VMs and CloudController_Worker VMs do you have ?
This happens with any app as the cleanup job running on the CloudController_Worker deletes all droplets so all "contrainer images" for all apps. Thus every app is unable to start once stopped/scaled or moved to another diego-cell.
HI @FloThinksPi Thanks for your feedback.
Thats the reason why we created the ruby snippet to reproduce the issue without any CF environment to make it much simpler. We will loook into getting this to work on your machine.
I think @xiaozhu36 already has a CC_Worker VM, which is theoretically sufficient for the backgorund task. It's probably not because of the lack of CC_Worker VMs. The specific reason for not being able to reproduce our debug scenes, i also think, it's better to have call with @xiaozhu36. This is probably the quickest way to find out the problem.
Fixed by 0.3.17.
We have evaluated fog-aliyun v0.3.8 on our AlibabaCloud CF landscape and encountered some problems with the blobstore.
Setup: cf-deployment v12.39.0 capi-release 1.91.0-sap.2
The capi-release patched to use fog-aliyun in v0.3.8.
Deployment of Cloud Foundry is successful. A first push of an application also works. But after stopping and restarting and application, the droplets in the blobstore are deleted. Apps cannot be started any more.
We suspect that directory names are not correctly determined in this part of the coding: https://github.com/fog/fog-aliyun/blob/876e7c570eb082af06162d4fae992a5e69f7906a/lib/fog/aliyun/models/storage/files.rb#L23. It could be that the path to the "buildpack_cache" folder is incorrectly truncated and so parent folders are deleted as well. The "buildpack_cache" content is cleaned up by the cc-worker jobs.