Error when calling Destroy() from within a job handler function

epelc commented 9 years ago

I've been seeing errors like the following lately. @albrow I was wondering if you could help me diagnose them.

ERR Error running script (call to f_7be0eed842fab7944ce1393b76bf8f46826c6656): @user_script:20: us er_script:20: attempt to concatenate local 'Status' (a boolean value) I found the line it does this here

I'm trying to destroy the job from within my handler function once it reaches a certain state. Is it not safe to do this?

When I look up the job id I just get finished and time fields using hgetall.

redis:6379[3]> hgetall jobs:bX8Vye9LGk80SDkbawj9qt3vvkpi
1) "finished"
2) "1435185638756948580"
3) "time"
4) "1435200038500419354"
redis:6379[3]>

Is it supposed to leave the job in redis after you destroy it? From the destroy_script.lua docs it dosen't sound like it.

albrow commented 9 years ago

@epelc, thank you for reporting this bug and I'm sorry that this is causing you problems. I wanted to let you know that I saw this and I'll be working on helping you as soon as I can. Unfortunately because of work I won't be able to spend any real time on this until Thursday or Friday.

albrow commented 9 years ago

@epelc I am unable to reproduce this. There is a test for Destroy that passes on my machine. If you are able to reproduce the bug consistently, would you mind telling me exactly how? If possible it would help tremendously to see some code.

albrow commented 9 years ago

To answer your questions:

I cannot answer that question until I better understand what you're doing and what is causing the bug. I can imagine a scenario where it can cause problems if you try and destroy a job while it's still being executed by some worker.
Yes, Destroy is supposed to remove all traces of the job from the database. It sounds like some bug is preventing Destroy from finishing and that's why you're still seeing some fields.

As an aside, it appears that the bug is occurring in the middle of a transaction. Redis is not ACID, so it is possible for a transaction to partially execute, leaving the database in an inconsistent state. If I understand correctly, the redis-check-aof tool will help remedy this and put the database back into a consistent state.

epelc commented 9 years ago

@albrow Just got back from lunch. I'll explain what I'm doing a bit more before I try to get it reproduced.

I register a job
The job runs several times
During the job if it reaches a certain state destroys itself

Heres some psuedo code for the job function

func updateTrackingJob(shipmentId bson.ObjectId) error {
    // Lookup the shipment info in mongodb so we can get the job id
   // When we register the job we save it's id to our shipment
    s := findShipment(shipmentId)

    // do some work

    // if the work returns a certain case we need to destroy the job and remove it from our shipment

    j,err := jobs.FindById(s.JobId)
    if err != nil {
        return err
    }

    err = j.Destroy()
    if err != nil {
        return err
    }

    //remove the job id from our shipment

    return nil
}

epelc commented 9 years ago

@albrow I haven't been able to reliable reproduce it 100% but I think I found the cause.

The workers doJob function doesn't handle jobs which destroy them self during execution. https://github.com/albrow/jobs/blob/master/worker.go#L92 Note it always sets the time and finished fields. It also adds it back to the jobs:time set which I think is causing it to run again without any info in the status field which is causing the erorr in the destroy script.

hget returns nil when the field doesn't exist which I think is evaluating as falsey in lua.

Return value Bulk string reply: the value associated with field, or nil when field is not present in the hash or key does not exist.

If you run this you can see that it does set the time and finished fields along with putting it in jobs:time after you destroy the job in the handler.

https://gist.github.com/epelc/3512e1e40e6cfc3cefc3 Note use the -vjl flag to see it find the id and destroy the jobs. Also it using db number 4 so SELECT 4 to get to it.

Let me know your thoughts are.

albrow commented 9 years ago

@epelc okay I see what's happening now. I didn't anticipate anyone destroying a job in a job hander while it was being executed. I have a fix in mind and will be creating a PR soon. When the PR is up, I would appreciate it if you could take a look and make sure it solves your problem.

epelc commented 9 years ago

I'll definitely test it out. Just let me know when it's ready. On Jul 4, 2015 4:50 PM, "Alex Browne" notifications@github.com wrote:

@epelc https://github.com/epelc okay I see what's happening now. I didn't anticipate anyone destroying a job in a job hander while it was being executed. I have a fix in mind, and will be creating a PR soon. When the PR is up, I would appreciate it if you could take a look and make sure it solves your problem.

— Reply to this email directly or view it on GitHub https://github.com/albrow/jobs/issues/26#issuecomment-118554950.

albrow commented 9 years ago

@epelc please check out #27. On that branch, I can run https://gist.github.com/epelc/3512e1e40e6cfc3cefc3 and get the expected results. That is, there are no remnants of the jobs if they are destroyed from inside a handler, and there are no errors or panics.

albrow / jobs

Error when calling Destroy() from within a job handler function #26