Bulk rerun work orders (from start)

amberrignell commented 1 year ago

User story

As a user that had a workflow which was running successfully, who then realised the logic isn't what I wanted, I would like to be able to bulk reprocess workflows, so that each workorder can be rerun with my updated logic.

Details

Allow users to select work orders from the history page Allow users to reprocess the work orders they have selected

Selecting work orders Using the following component, add a checkbox column to the history page.

There should be a checkbox in the first column (to the left of 'Workflow').
Every work order should have a checkbox, unselected by default
When the user selects the top-level checkbox in the column heading, all work orders on the page get selected
A user can select several individual work orders using the checkboxes within the work order rows

If a user has selected some work orders, then changes the filters, their selection is lost.

Reprocessing work orders
- When a user selects either the top-level checkbox OR work order level checkboxes, a 'rerun from start' button should appear.
- When the user clicks 'Rerun from start', the user should see a pop up modal which confirms which work orders have been selected (how many work orders will be reprocessed, and the search filters used to select these work orders)

Use the following modal component

A.Top-level select Message: You've selected the number_of_workorders_on_page work orders from page age_number of results for workorders matching the following query search_params. There are a total of number_of_workorders_matching_query work orders matching this query.

Example: You've selected the 12 work orders from page 1 of results for workorders matching the following query: received between x and y date, for workflow A, with a status of 'Failed'. There are a total of 32 work orders matching this query.

Buttons: Cancel, Rerun all work orders on this page from start, Rerun all work orders matching the query.

Screenshot 2023-04-12 at 10.28.24.png

B.Individual work orders selected Message: You've selected number_of_selected_work_orders work orders to rerun from the start (first job). This will create a new attempt for these within the same work order Example: You've selected 12 work orders to reprocess from the start (first job). This will create a new attempt for these within the same work order. Buttons: Cancel, Rerun from start

Screenshot 2023-04-12 at 10.31.21.png

When the user clicks 'Rerun from start', the users gets an alert "xx work orders will be bulk rerun" (so that they have feedback that their action was applied), and the work orders all get reprocessed in the same order in which they were received

Implementation notes

Please read through the platform-app implementation first Write a function that can take a list of work order IDs and create a new attempt for each one, in the order in which they have been passed to the function. Write a function that can take a query, and create a new attempt for each work order returned by the query, in the order in which they are returned.

The function rerun_run_from_start parameters would be: source: (list of IDs, or parameters) user: (used to add to the invocation reason)

Release notes

Bulk rerun work orders from start

User acceptance criteria

[ ] I can 'select all' work orders on the history page
[ ] Work orders get reprocessed in the order in which they were received
[ ] when I click 'Rerun from start', I see a pop-up which confirms how many work orders will be reprocessed, and the search filters used to select these work orders (ex: Reprocess 12 work orders which were received between x and y date, for workflow A, with a status of 'Failed')
[ ] If a user changes the filters, they lose their selected items.
[ ] If I click the top level select, all workorders are selected. If I unselect one, the top-level select is also unselected. If I click unselect from the top level, all of them are unselected.

taylordowns2000 commented 1 year ago

@amberrignell , tailwind component for this where employee rows are workorders : https://tailwindui.com/components/application-ui/lists/tables#component-390db22a862687fe88b313692989ffaf

note that this requires some JS

taylordowns2000 commented 1 year ago

Useful naming content:

taylordowns2000 commented 1 year ago

Possible approach:

see how "rerun" works currently in the liveview and/or service.
determine how a "batch" or a "query" of workorders could all be "rerun" using the current backend (maybe: is there a function that takes a list or workorder IDs and creates a new attempt for each workorder from the first job in the workflow with the initial state.)
write that function
then, consider how to pass workorder IDs from the UI (when i've selected 3 workorders) to that function.
for the modal, consider how to use the applied query (or "total" count as opposed to the paginated view) to give the user the option to send just those IDs select or ALL IDs regardless of current pagination.
send them for execution, display toast notification.

sigu commented 1 year ago

Jun-11-2023 21-08-32

How would you like the modal to be rendered, using the petal components or using liveview components?
We already have the list of work orders to be rerun.
Maybe get the first attempt and first run then rerun, use inserted_at to determine the oldest of each.
WordOrderService looks like a good place to restart the job from. I am currently thinking of calling retry_attempt_run(attempt_run, user) passing it an attempt run determined by 3 above.

sigu commented 1 year ago

Why do we need the following tables?

Attempt runs - currently assuming its a join table of attempts and runs
Jobs - what the big difference between this and a Run?
Dataclips - No idea at the moment, if we need to adjust anything in it I might need to have an idea.

For this assignment, should we change the invocation reason? (I am still not sure what invocation reason is , just inferred from the name)

amberrignell commented 1 year ago

Hey @sigu , thanks for the questions!

To give a quick answer,

Correct, attempt runs are a join table of attempts and runs (this is because two different attempts could have the same runs (if they get copied over for a “rerun from job 2” for example
A job is the adaptor, credential and job expression that a user has defined. A run is when that job has been run with some data (state)
A dataclip is the data that is used to run a job with

We'll see you in standup tomorrow so Stu can give you more details then, hope this helps in the meantime!

midigofrank commented 1 year ago

Hey @amberrignell @taylordowns2000 , should the rerun from start button get disabled / hidden when you don't have permission? Or should we allow the users to click around and get the notification error message

taylordowns2000 commented 1 year ago

hey @midigofrank , who doesn't have permission? (have you created a policy for this action? sorry, i have not seen anything related to permissions in the issue.)

midigofrank commented 1 year ago

Hey @taylordowns2000 , I noticed there's a check for can_rerun_job when rerunning a single job. I thought maybe we could use the same check for the bulk rerun

  def handle_event(
        "rerun",
        %{"attempt_id" => attempt_id, "run_id" => run_id},
        socket
      ) do
    if socket.assigns.can_rerun_job do
      attempt_id
      |> AttemptService.get_for_rerun(run_id)
      |> WorkOrderService.retry_attempt_run(socket.assigns.current_user)

      {:noreply, socket}
    else
      {:noreply,
       socket
       |> put_flash(:error, "You are not authorized to perform this action.")}
    end
  end

taylordowns2000 commented 1 year ago

Cool. For now please allow the error flash to appear as we don't have a design that explains why a user isn't able to run it. We can consider adding this later.

midigofrank commented 1 year ago

Alright. Perfect

midigofrank commented 1 year ago

Hey @taylordowns2000 , how would you suggest we handle failures. Let's say a single attempt fails, should we abort all the other attempts?

taylordowns2000 commented 1 year ago

hi @midigofrank , great q 😄

No, the scope of this feature ends at Oban.insert. In other words, this feature is to "queue up" a bunch of attempts (same thing that clicking "rerun" really fast on a bunch of attempts does) but once they're on the Oban queue they're in a totally different domain.

taylordowns2000 commented 1 year ago

So.... sorry to be extra clear: if there is an error before you successfully enqueue those attempts with Oban.insert (what might this be???) you could display a flash message explaining that something broke and asking the user to try again, but I don't want to do that since this would certainly be an unexpected error and I'd rather see it bubble up on Sentry then be able to debug the code. That make sense?

midigofrank commented 1 year ago

Hey @taylordowns2000 . That makes sense. Currently, I've wrapped all the actions (including the Oban.insert) for all the selected WorkOrders to be retried in a single transaction. Assuming we have Work Orders A, B, C, should an error occur while still retrying A, B and C won't get processed at all. Should that be the case?

taylordowns2000 commented 1 year ago

Sounds like we might need a quick call to nail this down—let's chat tomorrow! In the meantime:

When you say "while still retrying" do you mean "while you're attempting to add them to Oban" or "while they are running"? Assuming the first, I think what you actually want is the Oban.multi.

      reducer = fn {changesets, index}, multi ->
        Oban.insert_all(multi, "run_batch_#{index}", changesets)
      end

      runs
      |> Enum.chunk_every(5_000)
      |> Enum.with_index()
      |> Enum.reduce(Ecto.Multi.new(), reducer)
      |> Repo.transaction()

      conn
      |> put_status(:ok)
      |> text(Enum.count(runs))

What we can chat about tomorrow is that if the user asks to enqueue 43,000 attempts, it would be nice to either have them all get enqueue or have none of them get enqueued. (We don't know and don't care if the new attempts succeed or fail as they are processed over the next several days... our job is DONE and we report back to the end user "All good... 43,000 enqueued!" as soon as they hit the queue.)

taylordowns2000 commented 1 year ago

@midigofrank , next steps here please:

[x] rebase on main
[x] get all tests passing with current implementation style (then ping me on github please!)
[x] once tests are passing, optimize to use the Oban.insert_all so we're hitting the DB slightly less for a big bulk re-processing request

midigofrank commented 1 year ago

Hey @taylordowns2000 , might you have an idea why the Mimic library isn't getting loaded while testing? I keep getting the error below when I run mix test:

** (UndefinedFunctionError) function Mimic.copy/1 is undefined (module Mimic is not available)

taylordowns2000 commented 1 year ago

I haven't seen that problem pop up for anyone yet. Please push your most recent changes to your branch (assuming you've rebased, also check that box above!) and I'll fire it up.

Tests run OK for me as of this commit:

commit 04cdafe004f5aec7b19e7973166235640493f3b5 (HEAD -> 659-bulk-rerun-work-orders, origin/659-bulk-rerun-work-orders)
Author: Frank Midigo <midigofrank@gmail.com>
Date:   Wed Jun 14 12:58:21 2023 +0300

midigofrank commented 1 year ago

sure thing. Just pushed

taylordowns2000 commented 1 year ago

Eish, sorry @midigofrank , it looks ok on my machine and on circleCI. i'm not sure i'll be able to help.

What OS are you using? Have you seen this issue on other Elixir projects? Maybe try asdf install again just to make sure? @sigu , any ideas on this one?

midigofrank commented 1 year ago

Hey @taylordowns2000 , I'm on M1 MacOs

midigofrank commented 1 year ago

Let me try it on a different OS

taylordowns2000 commented 1 year ago

Hrmmm I'm on an M1 also.

midigofrank commented 1 year ago

@taylordowns2000 , I've been able to run the tests by setting MIX_ENV=test. I thought mix test does that for us 🤔

taylordowns2000 commented 1 year ago

Yeah it does. Maybe you've got MIX_ENV set to something else by default in your path?

On Fri, Jun 16, 2023 at 10:06 AM Midigo Frank @.***> wrote:

@taylordowns2000 https://github.com/taylordowns2000 , I've been able to run the tests by setting MIX_ENV=test. I thought mix test does that for us 🤔

— Reply to this email directly, view it on GitHub https://github.com/OpenFn/Lightning/issues/659#issuecomment-1594365055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCUBLK4X2OWNB3WFG4CZNDXLQOYRANCNFSM6AAAAAAVEN55UU . You are receiving this because you were mentioned.Message ID: @.***>

midigofrank commented 1 year ago

Aha, you're right. I made a copy of the .env.example and MIX_ENV has been set to dev. Let me get rid of it.

midigofrank commented 1 year ago

Hey @taylordowns2000 , all tests are passing except for one I've added for the liveview page. This only happens when I have the Oban.insert(Pipeline Job) code as part of the Multi which does everything else.

could not checkout the connection owned by #PID<0.1124.0>

EDIT: Full stack trace

** (EXIT from #PID<0.1124.0>) an exception was raised:
         ** (RuntimeError) ChildProcess task exited without a value:
     {%DBConnection.ConnectionError{message: "could not checkout the connection owned by #PID<0.1124.0>. When using the sandbox, connections are shared, so this may imply another process is using a connection. Reason: connection not available and request was dropped from queue after 510ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information", severity: :error, reason: :queue_timeout}, [{Ecto.Adapters.SQL, :raise_sql_call_error, 1, [file: 'lib/ecto/adapters/sql.ex', line: 913, error_info: %{module: Exception}]}, {Ecto.Repo.Schema, :apply, 4, [file: 'lib/ecto/repo/schema.ex', line: 756]}, {Ecto.Repo.Schema, :"-do_update/4-fun-3-", 15, [file: 'lib/ecto/repo/schema.ex', line: 459]}, {Lightning.Invocation, :update_run, 2, [file: 'lib/lightning/invocation.ex', line: 272]}, {Lightning.Pipeline.Runner.Handler, :"-start/2-fun-1-", 3, [file: 'lib/lightning/runtime/handler.ex', line: 74]}, {Task.Supervised, :invoke_mfa, 2, [file: 'lib/task/supervised.ex', line: 89]}, {Task.Supervised, :reply, 4, [file: 'lib/task/supervised.ex', line: 34]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]}
             (lightning 0.6.3) lib/lightning/runtime/handler.ex:122: Lightning.Pipeline.Runner.Handler.wait/1
             (lightning 0.6.3) lib/lightning/pipeline.ex:29: Lightning.Pipeline.process/1

midigofrank commented 1 year ago

Hey @taylordowns2000 , I found it easier to just move the implementation to "bulk mode" and leave out the Oban.insert operation from the main transaction. All tests are now passing.

taylordowns2000 commented 1 year ago

great! please convert this to a full PR (it's a "draft" right now) and request a review from me and @stuartc ! we'll take a look tomorrow AM. do you know if @sigu is joining our final call tomorrow AM?

also, in prep for that call, have a look at the failed checks related to code coverage. thanks @midigofrank and speak soon

midigofrank commented 1 year ago

Got it. Given his current timezone, I'm not so sure. I'll confirm with him and let you know

OpenFn / lightning