brianlmoon / net_gearman

A PHP interface for Gearman
84 stars 46 forks source link

Regarding a Binary Argument and the Gearman MySQL Queue #27

Closed bartclarkson closed 9 years ago

bartclarkson commented 9 years ago

Hello Brian -

Awesome stuff. The following may not be a net_gearman "issue", and I outline a resolution, but I can't be the only one that has face-planted on it. Wondering what you can say about it.

When it come to using gearman with mysql queue turned on, I have found that the only way I can have success in passing binary data as one of multiple arguments is to base64_encode the data first. I later decode it with a worker. The file string in question is read from a PDF file.

Without thus forcing the string to a longer utf8, the job just sits in MySQL with an empty "data" field. The worker fails because the arguments never arrive.

Does this sound about right to you? I'm not overjoyed by the extra base64 processing. Now that I have a grasp of a solution, I may well shove the binary data into a different table as a proper blob and pass the id of that row as the argument instead.

But that just seems wrong. The argument field is already a long blob. Perhaps the reason why tutorials show binary objects as arguments is because they use no extra persistence mechanism, or use memcache?

Anyway, thanks, one way or the other!

brianlmoon commented 9 years ago

I have the same problem with image data. Mine is the other diirection.

net_gearman does have the issue that it always returns an array. I plan to fix that soon.

As for your issue, binary data is likely not UTF8 safe. The mysql table should really use a blob, not a text for the data. I don't use it personally.

We use an NFS mount of a GlusterFS cluster to store large files that have to be passed around.

bartclarkson commented 9 years ago

Cool. I've opted to avoid an NFS situation as this code deploys to an autoscaling AWS group of EC2s, and the robustness/redundancy we're after seems best served with RDS acting as the MySQL backend to gearman in the context of how AWS is meant to glue together.

S3 is in the mix in terms of storing the original file and the final files, but the nature in which the PDF is burst into multiple pages, each of which are then immediately farmed out for asynchronous job processing, makes the relative lag of S3 unappealing for the intermediary steps.

Appreciate your thoughts, and I'll keep an eye out on your refactor to the array handling.

brianlmoon commented 9 years ago

I will look in net_gearman a bit and see if there is something going on there. The table I saw an example of shows a long blob for the data. That should not have any issues with binary data. We store images in them all the time.

It's possible I suppose that something in gearmand is messing it up. Not sure.

Are you passing the data in as a blob or array?

Brian.

On Mar 18, 2015, at 20:31, Bart Clarkson notifications@github.com wrote:

Cool. I'm opted to avoid an NFS situation as this code deploys to an autoscaling AWS group of EC2s, and the robustness/redundancy we're after seems best served with RDS acting as the MySQL backend to gearman in the context of how AWS is meant to glue together.

S3 is in the mix in terms of storing the original file and the final files, but the nature in which the PDF is burst into multiple pages, each of which are then immediately farmed out for asynchronous job processing, makes the relative lag of S3 unappealing for the intermediary steps.

Appreciate your thoughts, and I'll keep an eye out on your refactor to the array handling.

— Reply to this email directly or view it on GitHub.

bartclarkson commented 9 years ago

I'll share the salient bits, since a coder wants to see code, and for good reason.

First is the manager class that I use to wrap GearmanManager. Simple convenience factor of doing Dependency Injection relative to deployment environment ala Symfony2.

namespace GearmanManager\GearmanManagerBundle\Services;
use Symfony\Component\HttpKernel\Exception\HttpException as HttpException;
require_once '/usr/share/php/Net/Gearman/Client.php';
class GearmanManagerManager
{
    protected $job_server;
    public function __construct($GearmanManagerBundleConfig)
    {
        $this->job_server = $GearmanManagerBundleConfig['job_server'];
    }
    public function addBackgroundJob($function, $data)
    {
        $gmclient= new \Net_Gearman_Client($this->job_server);
        try {
            $result = $gmclient->$function($data);
        } catch (\Exception $e) {
            throw new HttpException("503: " . $e->getMessage(), $e->getMessage());
        }
    }
}

And then here's the code in question that uses it. "local_volume" is defined as the tmp directory worker machines write the pdf page before calling a CLI command to hammer the thing.

// ...
            $this->gearman_manager_manager->addBackgroundJob('ProcessDocumentPage',
                array(
                    'document_id' => $this->document_id,
                    'page_number' => $page,
                    'pdf_page_string' => base64_encode($pdf_page_string),
                    'local_volume' => $this->local_volume
                )
            );
// ...

And then the first bit of the worker.

class Net_Gearman_Job_ProcessDocumentPage extends \Net_Gearman_Job_Common {
    public function run($args) {
        $document_id = $args['document_id'];
        $page_number = $args['page_number'];
        $pdf_page_string = base64_decode($args['pdf_page_string']);
        $local_volume = $args['local_volume'];      
       // ..      
   }
}
bartclarkson commented 9 years ago

When I inspect the content of the "data" field in mysql, which is definitely typed as "long blob", I see this kind of thing:

{"document_id":"29","page_number":"2", "pdf_page_string": "a-lot-of-characters", "local_volume": "\/tmp"}

Which just looks for all the world like the kinda thing json_encode($some_keyed_array) produces in PHP.

I've also only ever really used blob fields where the blob was truly just One Thing. My favorite database application, Sequel Pro, even has built in functionality for viewing a blob as the file it actually is. Which doesn't have a prayer of succeeding if the blob isn't One Thing.

So I don't know that I can really do any better in my use case. Seems like the nature of the MySQL implementation.

If I set $args equal to $pdf_page_string, it might work. I guess that's what you're getting at when you're talking about passing an array vs the variable value? That would have never occurred to me. But I'd be hosed inasmuch as I need the other data points to do the job.

brianlmoon commented 9 years ago

Well, I would guess that json_encode is messing up the data. JSON requires valid UTF8 data. You could try using serialize instead. Your worker would need to unserialize.

Brian.

On Mar 18, 2015, at 21:02, Bart Clarkson notifications@github.com wrote:

I'll share the salient bits, since a coder wants to see code, and for good reason.

First is the manager class that I use to wrap GearmanManager. Simple convenience factor of doing Dependency Injection relative to deployment environment ala Symfony2.

namespace GearmanManager\GearmanManagerBundle\Services; use Symfony\Component\HttpKernel\Exception\HttpException as HttpException; require_once '/usr/share/php/Net/Gearman/Client.php'; class GearmanManagerManager { protected $job_server; public function __construct($GearmanManagerBundleConfig) { $this->job_server = $GearmanManagerBundleConfig['job_server']; } public function addBackgroundJob($function, $data) { $gmclient= new \Net_Gearman_Client($this->job_server); try { $result = $gmclient->$function($data); } catch (\Exception $e) { throw new HttpException("503: " . $e->getMessage(), $e->getMessage()); } } } And then here's the code in question that uses it. "local_volume" is defined as the tmp directory worker machines write the pdf page before calling a CLI command to hammer the thing.

// ... $this->gearman_manager_manager->addBackgroundJob('ProcessDocumentPage', array( 'document_id' => $this->document_id, 'page_number' => $page, 'pdf_page_string' => base64_encode($pdf_page_string), 'local_volume' => $this->local_volume ) ); // ... — Reply to this email directly or view it on GitHub.

bartclarkson commented 9 years ago

Read my mind. I was just looking at that, and at line 209 of master/Net/Gearman/Client.php.

But it's the same deal for serialize. There's a 4 year old comment for function.serialize.php at php.net about base64 encoding the binary portion of the object to be serialized.

It quite a stretch to assign responsibility to Client.php to iterate $task->arg, detect whether a given nested value is binary with something like ctype_print(), base64_encode it, and then somehow append data to trigger a later decode prior to passing it to the worker. I suppose it's possible, though.

I can really find no reasonable critique for the Gearmand MySQL approach, either. It invites too many questions to imagine Gearmand modifying the mysql integration such that multiple argument values are each written to a unique row as blobs.

There's nothing stopping a given developer from getting everything he or she needs out of the present approach. Either nest a scalar that indicates to the worker the location of the binary file/string, or base64_encode that sucker to a big scalar string.

Thanks so much for your time, Brian. It's a great project, and has helped me immensely.