eclipse-archived / smarthome

Eclipse SmartHome™ project
https://www.eclipse.org/smarthome/
Eclipse Public License 2.0
862 stars 783 forks source link

ESH Scheduler: Cron expression has no future #2105

Closed maggu2810 closed 7 years ago

maggu2810 commented 8 years ago

I am using the ESH scheduler (thanks for your work @kgoderis) and add a job using a cron expression. To execute the job every second I used this expression: * * * * * ?'

It runs for about two minutes and stops then

14:37:39.259 DEBUG o.e.s.c.p.PersistenceManager[:298] - Scheduled strategy everySecond with cron expression * * * * * ?
14:39:17.561 INFO  o.e.s.c.s.ExpressionThreadPoolManager[:178] - Expression '* * * * * ?' has no future executions anymore

@kgoderis Can you explain me, why this crone expression has no future executions anymore?

kgoderis commented 8 years ago

No, I have no clue why that is happening. I tested this kind of wild-fireing scenario during development. The issue lies within https://github.com/eclipse/smarthome/blob/90c80c1ceb54546a9af83e775485cda1a038cee2/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/scheduler/AbstractExpression.java#L209

The general approach is that a number of "next occurrence" candidates is generated (minimum 10, maximum 100), from which the first chronologically sorted occurrence is selected and returned.

I wondering if there is timing issue that arises because at one point in time dates/candidates are compared and pruned. Did you try with TRACE on org.eclipse.smarthome.core.scheduler ?

kgoderis commented 8 years ago

Ok - I think I found the issue. I think that in https://github.com/eclipse/smarthome/blob/90c80c1ceb54546a9af83e775485cda1a038cee2/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/scheduler/AbstractExpression.java#L209 the collection https://github.com/eclipse/smarthome/blob/90c80c1ceb54546a9af83e775485cda1a038cee2/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/scheduler/AbstractExpression.java#L40 is not cleared, but it rather continues from the candidates generated in a previous cycle. Either the collection is cleared, or some call to https://github.com/eclipse/smarthome/blob/90c80c1ceb54546a9af83e775485cda1a038cee2/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/scheduler/AbstractExpression.java#L134 and https://github.com/eclipse/smarthome/blob/90c80c1ceb54546a9af83e775485cda1a038cee2/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/scheduler/AbstractExpression.java#L85 has to be inserted in this method so that the collection of candidates is updated with new candidates using the new startdate. Thinking about it, https://github.com/eclipse/smarthome/blob/90c80c1ceb54546a9af83e775485cda1a038cee2/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/scheduler/AbstractExpression.java#L209 is a bit flawed. I think I was too enthusiastic at the time ;-)

And since maximum 100 candidates are generated, it consumes that last candidates after approx 100 seconds, or 1minute 40 seconds in your example

maggu2810 commented 8 years ago

This test will show you the error, too:

    @Test
    public void runForever() throws ParseException, InterruptedException {
        final ExpressionThreadPoolExecutor scheduler;
        scheduler = ExpressionThreadPoolManager.getExpressionScheduledPool("persist");
        final CronExpression expression;
        expression = new CronExpression("* * * * * ?");

        final AtomicInteger cnt = new AtomicInteger(0);

        scheduler.schedule(new Runnable() {

            @Override
            public void run() {
                System.out.println("new value: " + cnt.incrementAndGet());
            }
        }, expression);

        final long end = System.nanoTime() + TimeUnit.MINUTES.toNanos(5);
        while (System.nanoTime() < end) {
            final int value = cnt.get();
            Thread.sleep(TimeUnit.SECONDS.toMillis(5));
            if (value == cnt.get()) {
                throw new RuntimeException(String.format("Value not changed anymore (%d)", value));
            }
        }
    }

It stops after new value: 99

kgoderis commented 8 years ago

Ah... That is in line with my hypotheses of exhausting 100 candidate dates..... In fact we use getdateafter() in an unintended way. Either the expression is parsed and we query a fixed candidate set, or either the candidate set is dynamic (our case), in which it needs to be updated in each call to getdateafter()

kgoderis commented 8 years ago

@maggu2810 Maybe there is a thing or two we need to discuss and decide upon. I reflected on the problem you encounter and the "problem" lies in the fact that in terms of expressions like cron, RFC 5545 and alike is that in some expression languages/definitions a real start date of the date expression is not defined. This is for example the case with cron expressions. On the other extreme, an DateExpression that represents a single fixed date in time, has a startdate that is firm. That means that in order to evaluate expressions, you need to couple them with a startdate

The actual code, when creating an expression, assumes that this startdate is fixed over the lifetime of the expression. Obviously, in the case of your example here above, and in the context of scheduling, it is not the case: you want to evaluate the expression as of the actual datetime. The algorithm and approach to identify dates that match the expression is built in such a way that candidate dates are seeded, grown and pruned from a set of generated dates. For some kind of expressions, like your example, the series of possible dates is infinite. This is not always the case, like in RFC5545. So, in order to keep the algorithm performing, I limit this to a maximum of 100 dates.

So, in essence there are two ways to deal with this:

  1. You set a startdate, evaluate an expression, and during the next call to getTimeAfter(Date afterTime), the startdate has to be moved forward, and the set of candidate dates has to be "expanded" again to the maximum of 100. (In this scenario, we do not not even need 100 candidates, as even 1 would suffice).
  2. You set a startdate, evaluate an expression, and consider the set of candidates to be definitive, which you query via getTimeAfter(Date afterTime), and silently hope that afterTime is not reaching beyond the 100th candidate. This is how the method works right now, and is the reason why it stalls after 100 iterations

My suggestion would be to add some additional parameter to getTimeAfter(Date afterTime) to steer the wanted behaviour, e.g. wether the startDate has to be altered if no candidate was found in a first try (e.g. you would set startdate to the value of afterTime). Alternatively, we define an abstract method on Expression like "boolean isFloatingStartDate()", which needs to be implemented by the implementing classes. It woud return true for CronExpression, false for RFC5545, .... and could be used to change the behaviour of getTimeAfter(Date afterTime).

I am not sure if I expressed thing well, but wdyt?

maggu2810 commented 8 years ago

Hi @kgoderis, thanks for your detailed analysis. I assume point 2 is a "no-go" because it will break cron expressions that hit more then 100 times.

I don't know much about the current scheduler code base, but if no candidate was found I assume you refer to "the 100 candidates of the first search are consumed completely". If the real search for candidates does not found anyone, I assume set the start date to a future date will not change anything. But if the found candidates are empty, we could start another run and only if this run does not find any new candidates the expression / timer expires. I don't read the RFC5545, but I assume also for non floating start dates there could be more then 100 results and we should not drop any of them after the first 100 are consumed.

Makes that sense at all?

kgoderis commented 8 years ago

I don't know much about the current scheduler code base, but if no candidate was found I assume you refer to "the 100 candidates of the first search are consumed completely". If the real search for candidates does not found anyone, I assume set the start date to a future date will not change anything.

Indeed, and in that case the getTimeAfter call should return null. And as you suggest, we first have to try to replenish the pool of candidates, and if none come up, we return the null value.

The question is: it is the responsibility of the calling party to first set (or re-set) the start date, and then do the call, or do we let the Expression handle this based on what it finds?

maggu2810 commented 8 years ago

Can we do something similar to this one?

    private void calcCandidates(Date startTime) throws ParseException {
        setStartDate(startTime);
        parseExpression(expression);
        Collections.sort(candidates);
    }

    @Override
    public Date getTimeAfter(Date afterTime) {
        try {
            if (candidates.isEmpty()) {
                calcCandidates(afterTime);
            }
            while (!candidates.isEmpty()) {
                final Date candidate = candidates.remove(0);
                if (candidate.after(afterTime)) {
                    // After removing the last element, we need to populate the next candidates.
                    calcCandidates(candidate);
                    return candidate;
                }
            }
        } catch (final ParseException ex) {
            // log...
        }
        return null;
    }

If the start time needs something that is explicit set, we need to recalculate candidates without setting the global start time.

If the candidates are not limited to 100 (ATM) then I assume that the getFinalFireTime is also not correct.

maggu2810 commented 8 years ago

Does this concept with that candidates work for cron expressions at all? The getTimeAfter(Date) is described this way: "Returns the next date/time after the given date/time which satisfies the expression."

But if we are using candidates and the lookup is done in the candidates (which could be unlimited for cron expressions) then we will only receive correct results if "afterTime" is never set in to some value in front of the start time or so. So it should also be stated in the documentation that afterTime must not jump in the past again.

kgoderis commented 8 years ago

Can we do something similar to this one?

mmh... it assumes that at each call of getTimeAfter, the afterTime is later than the previous afterTime from the previous call, no? It also has a bit of a performance issue because the set is recalculated all the time.

What if we change the strategy completely? and calculate a new candidate set with every call to getTimeAfter()?

we define "Set parseExpression(expression, afterTime)" that does the following in each "round"

  1. Parse the expression by expanding some seeds, taking into account startDate as "firm" date that does not change during this process, unless Expression.hasFloatingStartDate() return true. Then we do setStartDate(afterTime)
  2. Prune unnecessary candidates
  3. We set the minimum/maximum number of candidates to a/b with a and be to be defined
  4. If all candidates in the set are before afterTime (and the number of candidates is equals to b), we loop through 1. to 3. again with the last date from the just calculated set as initial seed. If we have less that b candidates in the set, then we know that we are at the end of the series

The result is a set of candidate set of dates that adhere to the expression and the startdate From that set, we return the first candidate after afterTime

kgoderis commented 8 years ago

If the candidates are not limited to 100 (ATM) then I assume that the getFinalFireTime is also not correct.

You have a point here, but I have to revisit what I have implemented (E.g. return null or alike in certain cases)

kgoderis commented 8 years ago

Does this concept with that candidates work for cron expressions at all?

I must admit that I had first developed the RFC5545 part, and then added in cron. I think that the solution lies in setting the startDate to afterTime in case of certain types of expressions, like cron, that do not define explicit start dates.

kgoderis commented 7 years ago

I am preparing fix for this