act-now-coalition / can-scrapers

MIT License
9 stars 13 forks source link

Only allow one instance of each flow to run at a time #412

Closed smcclure17 closed 2 years ago

smcclure17 commented 2 years ago

Adds a state handler that, whenever a flow run is set to execute, checks to see if the current flow already has another instance running and if so, skips the instance that's about to start.

Last night, it looks like two NYT scrapers kicked off around the same time (see screenshots below), and I think this might be what caused them to take longer to complete. #410 eliminates the need to rely on scheduling/timing, so if the scrapers take longer in the future it won't be as big of a deal, but regardless, there's no need to have two instances of the same scraper flow running at the same time.

image image
smcclure17 commented 2 years ago

Yeah, that's a good flag, I think we maybe should be. We get the start time of the currently running flow from the request, so we could potentially just say "if the running flow has been running for 2 hours, start anyway" but that's hacky.

The best route is to make sure we have proper timeouts for all the flows, which I'm not sure we do at the moment.

smcclure17 commented 2 years ago

9e0e132e4235af7b6fa41a7371152d9dfebcb5a2 Adds a timeout of 2 hours for all the flows generated from the MainFlow, except for the UpdateParquetFiles flow which is set to 1 hour. With these added, I feel comfortable that flows shouldn't get hung up and cause the following runs to be skipped.