artefactual-sdps / temporal-activities

Temporal activities is a library of general purpose activities
Apache License 2.0
1 stars 0 forks source link

Extend removefiles activity to work with patterns #13

Closed jraddaoui closed 6 months ago

jraddaoui commented 6 months ago

Allows to configure the removefiles activity with a string of comma delimited regular expressions alongside full names. Delete files and directories based on both configuration fields and fail if there is an invalid regexp in the config.

Refs #12.

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 77.69%. Comparing base (9dd2981) to head (c6647df).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #13 +/- ## ========================================== + Coverage 77.20% 77.69% +0.49% ========================================== Files 3 3 Lines 136 139 +3 ========================================== + Hits 105 108 +3 Misses 17 17 Partials 14 14 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jraddaoui commented 6 months ago

I don't know what you mean by "following that approach for all the activities in this repository" - doing validation of the configs?

I don'tr know why I can't reply to that comment directly. I mean making the return of an error a pattern in all NewActivity functions to prevent backwards incompatible changes in the future.

jraddaoui commented 6 months ago

@jraddaoui do you have a use case for the regex matching? It would be nice to know what the intended use is, to get some idea what kind of patterns to expect.

Already using this implementation in https://github.com/artefactual-sdps/preprocessing-sfa/pull/5.

djjuhasz commented 6 months ago

I don'tr know why I can't reply to that comment directly. I mean making the return of an error a pattern in all NewActivity functions to prevent backwards incompatible changes in the future.

Hmm, I'm hesitant to return an error from all activity constructors - it will mean updating all the worker activity registrations from:

w.RegisterActivityWithOptions(
    activities.NewDownloadActivity(logger, tp.Tracer(activities.DownloadActivityName), wsvc).Execute,
    temporalsdk_activity.RegisterOptions{Name: activities.DownloadActivityName},
)

to:

downloadActivity, err := activities.NewDownloadActivity(logger, tp.Tracer(activities.DownloadActivityName), wsvc)
if err != nil {
    // Handle error
}
w.RegisterActivityWithOptions(
    downloadActivity.Execute,
    temporalsdk_activity.RegisterOptions{Name: activities.DownloadActivityName},
)

We're also then handling errors when the worker starts, rather than when the workflow runs.

I guess that does raise the question of whether we want the worker to just fail to start if there's a bad regex pattern in the config? Maybe it's better to log a warning and just ignore the bad pattern?

djjuhasz commented 6 months ago

@jraddaoui do you have a use case for the regex matching? It would be nice to know what the intended use is, to get some idea what kind of patterns to expect.

Already using this implementation in artefactual-sdps/preprocessing-sfa#5.

Hmm, we could use https://pkg.go.dev/path/filepath#Match instead of regex for that use case - it's probably a bit less expensive, but obviously not as flexible as regex matching. It's also less likely for users to shoot themselves in the foot with simple glob matching patterns rather than regex. ;)

jraddaoui commented 6 months ago

I don'tr know why I can't reply to that comment directly. I mean making the return of an error a pattern in all NewActivity functions to prevent backwards incompatible changes in the future.

Hmm, I'm hesitant to return an error from all activity constructors - it will mean updating all the worker activity registrations from:

w.RegisterActivityWithOptions(
  activities.NewDownloadActivity(logger, tp.Tracer(activities.DownloadActivityName), wsvc).Execute,
  temporalsdk_activity.RegisterOptions{Name: activities.DownloadActivityName},
)

to:

downloadActivity, err := activities.NewDownloadActivity(logger, tp.Tracer(activities.DownloadActivityName), wsvc)
if err != nil {
    // Handle error
}
w.RegisterActivityWithOptions(
    downloadActivity.Execute,
    temporalsdk_activity.RegisterOptions{Name: activities.DownloadActivityName},
)

We're also then handling errors when the worker starts, rather than when the workflow runs.

I guess that does raise the question of whether we want the worker to just fail to start if there's a bad regex pattern in the config? Maybe it's better to log a warning and just ignore the bad pattern?

Yes, that's the idea. I'm talking about the activities from this repository only, as they are the ones that are shared. In some cases it may be better to fail on registration than in workflow, and following that pattern may prevent backward incompatible changes.

djjuhasz commented 6 months ago

@jraddaoui what about adding a config Validate() method and validating the Regex patterns there? Another option would be to add a string->regex decoder for viper.Unmarshal().

I'm reluctant to add a error to the constructor return signatures because:

  1. It requires changing a bunch of code in multiple workers.
  2. If we only change the activity constructors in this library, then we have to keep track of that difference when calling the constructors in the enduro workers.
  3. It's unusual to return an error from a constructor in Go. Go constructors do return an error when necessary (e.g. https://pkg.go.dev/net/http#NewRequest) but it's uncommon.
  4. The Temporal docs don't recommend returning an error from an activity constructor like they do from a workflow execution of an activity. Maybe this is because a worker restart is always necessary when you change the code so you don't have the problem of changing signatures of a workflow in flight?
jraddaoui commented 6 months ago

Hi @djjuhasz, sorry for the delay answering this.

Hmm, we could use https://pkg.go.dev/path/filepath#Match instead of regex for that use case - it's probably a bit less expensive, but obviously not as flexible as regex matching. It's also less likely for users to shoot themselves in the foot with simple glob matching patterns rather than regex. ;)

The final regex was going to be a bit more complex to avoid the deletion of a PREMIS file that should remain in the directory. However, I realized that Go regex does not support lookarounds, so it won't be as useful. In the end I renamed the file that should be kept (not only for this reason) and we are still using that basic regex. I also think that Matchfunction is less flexible but still allows users to shoot themselves in the foot.

@jraddaoui what about adding a config Validate() method and validating the Regex patterns there? Another option would be to add a string->regex decoder for viper.Unmarshal().

I'm not planning to make this user configurable, the implementation I linked has a fixed regex, just to remove the PREMIS files.

I'm reluctant to add a error to the constructor return signatures because:

  1. It requires changing a bunch of code in multiple workers.
  2. If we only change the activity constructors in this library, then we have to keep track of that difference when calling the constructors in the enduro workers.
  3. It's unusual to return an error from a constructor in Go. Go constructors do return an error when necessary (e.g. https://pkg.go.dev/net/http#NewRequest) but it's uncommon.
  4. The Temporal docs don't recommend returning an error from an activity constructor like they do from a workflow execution of an activity. Maybe this is because a worker restart is always necessary when you change the code so you don't have the problem of changing signatures of a workflow in flight?
  1. I was only thinking on the activities that are shared in this repository.
  2. I don't think that's an issue.
  3. I see many constructors returning errors, specially in our code.
  4. The main goal is to avoid backwards incompatibility issues in projects using this activities.

We're also then handling errors when the worker starts, rather than when the workflow runs.

Yes, that's exactly why I think that approach is better. If all workflows are going to fail I think it's better to stop the worker, specially if it's running a single workflow.

djjuhasz commented 6 months ago

I'm not planning to make this user configurable, the implementation I linked has a fixed regex, just to remove the PREMIS files.

I think is actually a design problem. If we wanted to use the removefiles.Activity to remove a different set of files in the same worker now we're stuck - we've registered the activity with a set "RemovePattern" (which will replace anything set in the config file), and we can't re-register it with a different "RemovePattern" or "RemoveNames" config.

I think maybe we are better to move the file removal pattern/lists to the execute method parameters, so we can register the activity once and call it multiple times if we need to remove different kinds of files. This would also allow us to remove different types of files at different points in the workflow as necessary.

Making RemovePattern a call parameter would move responsibility for validating the regex to the caller, rather than having to do it in the activity constructor. For example, in your current implementation we don't need to validate the regex - you've hard-coded the expression and we know it's valid.

jraddaoui commented 6 months ago

I agree, that's why I didn't like the solution from https://github.com/artefactual-sdps/temporal-activities/pull/7. I remember talking about it, but I can't find where. I'll do that and change its usage in preprocessing-moma ;)