IBM / ubiquity

Ubiquity
Apache License 2.0
90 stars 26 forks source link

Idempotent provisioner issue #206

Closed hfeish closed 5 years ago

hfeish commented 6 years ago

Changes:

  1. if fail to add to DB then delete the volume from the array
  2. add retry if fail to add to DB due to network issue
  3. if vol already exist just import it to the DB, if fail to add to DB, return false, will not delete the vol from array

Signed-off-by: feihuang feihuang@feihuangs-mbp.cn.ibm.com


This change is Reviewable

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.2%) to 52.954% when pulling 752cd0144208f97302950feb350dbc616379a626 on fix/UB-612_Idempotent_provisioner_issue into 4cadd2acec7fc77182550f27408d608cc5da46c7 on dev.

ranhrl commented 6 years ago

Without reviewing the code, if you have no.3 then why do you need no.1 (why delete the volume?). Another option would be:

This will easily allow you to know where you failed, when you get a request, you know if it's not the first time.


Review status: 0 of 7 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

hfeish commented 6 years ago

If we cannot delete vol successfully in no.1, then if we retry to create vol with same name, then no.3 will work. Or sometime, we create the volume which name is same with volume exist on the array.

For you option, do you mean that we change the order of the volume creation? If so, since we had issue when insert vol in the DB, will we hit the issue when we try to mark the vol as created in the DB, because we also need to open the DB to mark it

ranhrl commented 6 years ago

Review status: 0 of 7 files reviewed at latest revision, all discussions resolved.


database/connection.go, line 88 at r1 (raw file):


    // open db connection, retry 3 time with 5s sleep interval
    for i := 0; i < 4; i ++ {

If we don't already have one, we should have a general retry function. Please look at the code if we have one. An example of a general retry function can be found here https://blog.abourget.net/en/2016/01/04/my-favorite-golang-retry-function/ It also mentions open source retry package that we can leverage: https://github.com/giantswarm/retry-go


local/scbe/scbe.go, line 277 at r1 (raw file):

      s.logger.Debug("Volume not exist on the storage, need to provision")
  } else {
      s.logger.Debug("Volume exists on the storage, no need to provision")

This should be a warning and not a debug level, as this is not the common use case


local/scbe/scbe.go, line 299 at r1 (raw file):

      // Failed to add volume to DB, start to delete volume from the array
      if errDeleteVolume := scbeRestClient.DeleteVolume(volInfo.Wwn); errDeleteVolume != nil {
          return s.logger.ErrorRet(errDeleteVolume, "scbeRestClient.DeleteVolume failed")

You will get error delete volume when trying to create a volume, it does not make sense, what will the caller do with that error? I don't think the design here is perfect.


Comments from Reviewable

hfeish commented 6 years ago

Review status: 0 of 8 files reviewed at latest revision, 3 unresolved discussions.


database/connection.go, line 88 at r1 (raw file):

Previously, ranhrl (Ran Harel) wrote…
If we don't already have one, we should have a general retry function. Please look at the code if we have one. An example of a general retry function can be found here https://blog.abourget.net/en/2016/01/04/my-favorite-golang-retry-function/ It also mentions open source retry package that we can leverage: https://github.com/giantswarm/retry-go

To use general retry function, we need to define func as " func() (err error)", so here we had to retry open instead of "c.factory.newConnection" as "newConnection() (*gorm.DB, error)"


local/scbe/scbe.go, line 299 at r1 (raw file):

Previously, ranhrl (Ran Harel) wrote…
You will get error delete volume when trying to create a volume, it does not make sense, what will the caller do with that error? I don't think the design here is perfect.

In ubiquity we only print the error message, so change it to a readable message


Comments from Reviewable

ranhrl commented 6 years ago

Sure, you will hit the same issue when you try to mark it, but next time you can "track" your actions, means that you will find the volume in the DB marked as "started to create but it is not completed yet", you will then find that the volume was created in the storage and then you will try again to mark it as "completed". It just gives you extra certainty in what is the current status, this way you don't override the volume if it's not you who created it. I know the chances are low (because of the ubiquity ID in the volume name) so it's not mandatory but a better way to handle when you have a transaction. I know Shay did not point to that solution because the current one is simpler to implement, but maybe we should think further when Shay is back if we want to improve it or not. Could be done as a separate ticket.


Review status: 0 of 8 files reviewed at latest revision, 3 unresolved discussions.


database/connection.go, line 88 at r1 (raw file):

Previously, hfeish (Fei Huang) wrote…
To use general retry function, we need to define func as " func() (err error)", so here we had to retry open instead of "c.factory.newConnection" as "newConnection() (*gorm.DB, error)"

well done!


local/scbe/scbe.go, line 299 at r1 (raw file):

Previously, hfeish (Fei Huang) wrote…
In ubiquity we only print the error message, so change it to a readable message

You should give details about the volume name, wwn etc with the message, otherwise it will be harder to debug.


utils/utils.go, line 287 at r2 (raw file):

}

func Retry(attempts int, sleep time.Duration, f func()(err error)) (err error) {

The function is not general enough, maybe it's because I am used to python, but it cannot get any parameters and cannot return any value besides the error, will that support anyone trying to use it in the future?


utils/utils.go, line 297 at r2 (raw file):

        }
        time.Sleep(sleep)
      logs.GetLogger().Debug("retrying", logs.Args{{"attempt", i}, {"error", err}})

please add the function name to the debug message if possible.


Comments from Reviewable

hfeish commented 6 years ago

Open a ticket UB-1186 to discuss the new solution


Review status: 0 of 8 files reviewed at latest revision, 5 unresolved discussions.


local/scbe/scbe.go, line 277 at r1 (raw file):

Previously, ranhrl (Ran Harel) wrote…
This should be a warning and not a debug level, as this is not the common use case

change to warning


local/scbe/scbe.go, line 299 at r1 (raw file):

Previously, ranhrl (Ran Harel) wrote…
You should give details about the volume name, wwn etc with the message, otherwise it will be harder to debug.

Add wwn and name


utils/utils.go, line 287 at r2 (raw file):

Previously, ranhrl (Ran Harel) wrote…
The function is not general enough, maybe it's because I am used to python, but it cannot get any parameters and cannot return any value besides the error, will that support anyone trying to use it in the future?

seems "fun() error" is general here, cannot support more para or return


utils/utils.go, line 297 at r2 (raw file):

Previously, ranhrl (Ran Harel) wrote…
please add the function name to the debug message if possible.

Add fun name here


Comments from Reviewable

shay-berman commented 6 years ago

Hi @hfeish

Some comments about the design describe in the PR:

  1. What happened if you create volume on the storage but then you fail to update the ubiquity DB? In that case you plan to delete the volume from the storage and raise error (according to the PR description). But what happened if in that case you will get error during the deletion of the volume? What is the plan for this?

  2. About retry if fail to add to the DB, its OK, but what is the retry mechanize? how many retries?

  3. About "if vol already exist just import it to the DB, if fail to add to DB, return false, will not delete the vol from array". This is problematic - because it does automatic import for volume even if the customer don't intend to do so. Import volume is a feature that should be cover later on. We should do the import volume explicit (and not implicit) for example when customer wants to import existing volume he will have to add a special annotation in the PVC for importing existing volume.
    So I think we should handle this DB fail in different way. For example (and I think @ranhrl already mention it) - before creating the volume we should write a line in the DB something like "vol name is under creating", and then you create the vol on the storage, and then if you fail to update the DB with the new volume WWN then you fail (you can do retires as you mentioned in [2]). Then if another creating request comes you can see that the volume may already created so you check if its exist and if so you "import" it. But only if the "vol name is under creating" state in the DB. Doing such you prevent implicit volume import and also you prevent don't need to delete the volume on the storage because you probably will get another creation request on the same volume sooner or later. (BTW please check if a PVC fail to create will automatically trigger another creation request - I am not sure about it)

@ranhrl did I miss something here?

shay-berman commented 6 years ago

Review status: 0 of 8 files reviewed at latest revision, 5 unresolved discussions.


database/connection.go, line 88 at r1 (raw file):

Previously, ranhrl (Ran Harel) wrote…
well done!

its ok but I must say that this util.Retry() is very specific function signature. so it fit only for functions with this signature so its not really generic.


database/connection.go, line 81 at r3 (raw file):

func (c *Connection) Open() (error) {
    defer c.logger.Trace(logs.DEBUG)()
    if err := utils.Retry(5, 5*time.Second, c.retryOpening); err != nil{

Please set the retries and time.Seconds as const In addition @fei, please check if there is timeout for the creation task in kubernetes side. You set 25 seconds retries but we need to make sure k8s provisioner will not kill us before that time.


local/scbe/scbe.go, line 299 at r1 (raw file):

Previously, hfeish (Fei Huang) wrote…
Add wwn and name

@Fei, please check out what SCBE does when the same issue happened. I don't think they delete the volume from the storage.


local/scbe/scbe.go, line 276 at r3 (raw file):

  if len(volumeInfo) == 0{
      s.logger.Debug("Volume not exist on the storage, need to provision")
  } else {

this is automatic import feature - which I don't think we aim for. Mentioned it as comment on the PR


local/scbe/scbe.go, line 279 at r3 (raw file):

      s.logger.Info("Warning: Volume exists on the storage, no need to provision")
      // TODO idempotent, if volume attached to host, we should handle when attach this volume to another host.
      err = s.dataModel.InsertVolume(createVolumeRequest.Name, volumeInfo[0].Wwn, fstype)

BTW there is a chance you will get volumeInfo list 2 items, in that case you may consider to raise error. The chances are very small, but here you assume that if the len(volumeInfo) is not 0 then its always 1, but you cannot know for sure. Its better to handle len>1 as well.


local/scbe/scbe_rest_client.go, line 142 at r3 (raw file):

}

func (s *scbeRestClient) GetVolumesByVolName(volName string) ([]ScbeVolumeInfo, error) {

BTW this function is ok, we will need it when we will do explicit import


local/scbe/scbe_rest_client_test.go, line 191 at r3 (raw file):

                {Name: volName + "2", ScsiIdentifier: volIdentifier + "2", ServiceName: profileName + "2"},
            }
            fakeSimpleRestClient.GetStub = OverrideGetStub(volumes)

BTW what is OverrideGetStub? where it comes from?


local/scbe/scbe_test.go, line 354 at r3 (raw file):

      })

      It("should succeed to create volume if vol already exist on array", func() {

this is ok, but as mentioned its a wrong feature to automatic import in implicit way.


utils/utils.go, line 287 at r2 (raw file):

Previously, hfeish (Fei Huang) wrote…
seems "fun() error" is general here, cannot support more para or return

sure I know, you cannot. I just said that its generic for a very specific function signature


Comments from Reviewable

shay-berman commented 5 years ago

This PR was not maintained too much time. I delete the PR but keep the branch