elasticquent / Elasticquent

Maps Laravel Eloquent models to Elasticsearch types
MIT License
1.35k stars 400 forks source link

Processing a big data #109

Open alongmuaz opened 8 years ago

alongmuaz commented 8 years ago

Hi,

I have a few table in database, total size is 16 gigabytes ; 23,000,000 rows .

I deploy it on AWS ECE2 ; 1gb ram .

If anyone get an idea how to index all of it with fast as possible?

thank you

scripta55 commented 8 years ago

I have the same situation... the model::addAllToIndex causes a 502 gateway, basically times out

alongmuaz commented 8 years ago

the timeout error is cause by server, the server cant query all the data simultaneously , there is must a proper way to index the big data .

Dmitri10 commented 8 years ago

I don't know better solution than using chunks for big data. I've got own question with example - you may take example from there

alongmuaz commented 8 years ago

Hi @Dmitri10 , thanks for your reply , i have 23 millions of rows data , I try to add to index , 1 hour able to index around 10k record, can you imagine how many days to take for 23 millions of data?

scripta55 commented 8 years ago

@alongmuaz 10 thousand in one hour sounds too long, i managed to sort my issue out, and this was the solution. 446 thousand records took about 3mins. This was in a command i could execute on the server by the way.

` /*ORGS*******/ $orgsApi = new OrgsAPI(); $this->info('Starting elastic indexing....');

    $orgsApi->esCreateIndex();

    $orgCount = DB::table('orgs')->count();

    $this->info("Number of orgs: " . $orgCount);
    $this->info('Adding org documents...');
    $bar = $this->output->createProgressBar(1);

        $orgsAmountPerChunk  = 1000;
        $orgsApi->esLoadData($orgsAmountPerChunk);

        $this->comment("HP: Orgs Indexed");

        $bar->advance();

    $bar->finish();
    $this->info(' HP Orgs Indexing Complete!');

    $check          = new DailyChecks();
    $check->date    = date('Y-m-d');
    $check->process = "hp index - orgs";
    $check->save();`
alongmuaz commented 8 years ago

Hi @scripta55 , do you run these code using php artisan tinker? thank you

scripta55 commented 8 years ago

@alongmuaz yes, check this out: https://laravel.com/docs/5.3/artisan#writing-commands

Dmitri10 commented 8 years ago

@alongmuaz, @scripta55 is right - your operation is too long for 10k records. For example I've got extra logic

Route::get('fullRefreshIndex', function () {
    set_time_limit(0);
    ini_set('max_execution_time', 0);

    try {
        \App\Models\User::deleteIndex();
    } catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
        print_r('no such index');
    }

    try {
        \App\Models\User::createIndex();
    } catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
        print_r('cant create index');
    }

    $cards = \App\Models\CardType::whereStatus('active')->get();
    $cardsArray = [];
    $chunkSize = config('app.chunk_user_model');
    foreach ($cards as $card) {
        $cardsArray[] = $card->id;
    }
    $users = \App\Models\User::whereStatus('active')
        ->whereIn('card_type_id', $cardsArray);

    $users->chunk($chunkSize, function ($users) {
        print_r('new chunk');
        echo '<br/>';

//        $users->each(function ($user) {
//            $user->addToIndex();
//        });
/* just saw that it would be a little faster than looping */
         $users->addToIndex();
    });
});

and for 160k records it took 4 minutes... @scripta55, could you show us your function esLoadData please? $orgsApi->esLoadData($orgsAmountPerChunk);

scripta55 commented 8 years ago

@Dmitri10 this could be slowing it down?

    $users->each(function ($user) {
        $user->addToIndex();
    });

You can use the bulk function that basically takes the chunk and inserts into elastic, also i noticed the more you chunk the more time the worker takes to load into memory before inserting, so as low as possible as you can go on the chunk the better.

as for function in esLoadData:

ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });

Dmitri10 commented 8 years ago

@scripta55 Thanks. Yes, I updated my code before your answer (in comments) and showed another example with using addToIndex to builder 4 days ago but little chunks don't work faster for me, so if you use nothing new.. then it seems your server has better RAM 👍

scripta55 commented 8 years ago

@Dmitri10 possibly so, i ran this test on a virtual box with 2 gigs ram

you can directly index from sql to elastic, that could work for you? using JDBC it is what i intend to do at a later stage

Dmitri10 commented 8 years ago

@scripta55 Yes, I've got only 1 gb ram) thanks a lot for JDBC, I didn't know about it anything!

scripta55 commented 8 years ago

Enjoy! it would be a great solution without the overhead of laravel

alongmuaz commented 8 years ago

@scripta55 , you mean Logstash - JDBC?

scripta55 commented 8 years ago

this one https://github.com/jprante/elasticsearch-jdbc

scripta55 commented 8 years ago

There are cool examples with it, its standalone; it runs on your server/box independently

Dmitri10 commented 8 years ago

Cool library except one thing - there are too much opened issues now and you may catch one of them and wait fix...

scripta55 commented 8 years ago

@Dmitri10 true, however most of those issues are just lack of understanding from most users. Its a really straight forward solution to execute :) When i tried it out, i had lots of questions simply because my mindset was not on track with how it works and what it simply is supposed to do. give it a go!

Dmitri10 commented 8 years ago

Thanks one more time anyway! If i have free time I'll try it.

alongmuaz commented 8 years ago

@scripta55 , where the OrgsAPI() referring to?

Dmitri10 commented 8 years ago

@alongmuaz , It's his own class where he added his functions esCreateIndex(), esLoadData(), etc.

And he told you about main function for indexing data using model and chunks:

as for function in esLoadData: ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });

alongmuaz commented 8 years ago

hai @scripta55 , i got this error when using your codes :

[Elasticsearch\Common\Exceptions\ServerErrorResponseException]

scripta55 commented 8 years ago

hi @alongmuaz are you able to curl to you able to curl to the elastic server? from outside your homestead or http get from elastic

scripta55 commented 8 years ago

your result should be like this: http://d.pr/i/1hExn

alongmuaz commented 8 years ago

hi @scripta55 , yes , actually according to aws es , theres data inserted into nodes . suddenly when running your codes after 10 minutes .

scripta55 commented 8 years ago

http://192.168.10.10:9200/YOUR-INDEX-NAME/_count

please share the results you get from there. Also how many records are you intending to index into elastic?

alongmuaz commented 8 years ago

http://d.pr/i/1cTMd

scripta55 commented 8 years ago

did you load 2775000 on the first try? what is your total? and server memory/space is not all used up?

alongmuaz commented 8 years ago

yes , total is 3718988 , ram 4gb .

scripta55 commented 8 years ago

mhh... please provide your trace?

alongmuaz commented 8 years ago

i did not understand, what is the trace? forgive me :D

scripta55 commented 8 years ago

stack trace :) error report

alongmuaz commented 8 years ago

@scripta55 , finally https://d.pr/P8IW , thanks!

scripta55 commented 8 years ago

@alongmuaz did you just rerun? and it carried on or you found out what the issue was?

alongmuaz commented 8 years ago

@scripta55 , I just rerun, today i try to index another table and there is error :

[Elasticsearch\Common\Exceptions\RequestTimeout408Exception] Trace error :

[2016-09-09 03:11:09] local.ERROR: exception 'Elasticsearch\Common\Exceptions\RequestTimeout408Exception' in /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php:570 Stack trace:

0 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(262): Elasticsearch\Connections\Connection->process4xxError(Array, Array, Array)

1 /var/www/vendor/react/promise/src/FulfilledPromise.php(25): Elasticsearch\Connections\Connection->Elasticsearch\Connections{closure}(Array)

2 /var/www/vendor/guzzlehttp/ringphp/src/Future/CompletedFutureValue.php(55): React\Promise\FulfilledPromise->then(Object(Closure), NULL, NULL)

3 /var/www/vendor/guzzlehttp/ringphp/src/Core.php(341): GuzzleHttp\Ring\Future\CompletedFutureValue->then(Object(Closure), NULL, NULL)

4 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(283): GuzzleHttp\Ring\Core::proxy(Object(GuzzleHttp\Ring\Future\CompletedFutureArray), Object(Closure))

5 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(159): Elasticsearch\Connections\Connection->Elasticsearch\Connections{closure}(Array, Object(Elasticsearch\Connections\Connection), Object(Elasticsearch\Transport), Array)

6 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Transport.php(106): Elasticsearch\Connections\Connection->performRequest('POST', '/_all/_bulk', Array, '{"index":{"_id"...', Array, Object(Elasticsearch\Transport))

7 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Endpoints/AbstractEndpoint.php(80): Elasticsearch\Transport->performRequest('POST', '/_all/_bulk', Array, '{"index":{"_id"...', Array)

8 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Client.php(744): Elasticsearch\Endpoints\AbstractEndpoint->performRequest()

9 /var/www/vendor/elasticquent/elasticquent/src/ElasticquentCollectionTrait.php(40): Elasticsearch\Client->bulk(Array)

10 /var/www/app/Muaz/OrderApi.php(15): Elasticquent\ElasticquentCollection->addToIndex()

11 [internal function]: App\Muaz\OrderApi->App\Muaz{closure}(Object(Elasticquent\ElasticquentCollection))

12 /var/www/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Builder.php(372): call_user_func(Object(Closure), Object(Elasticquent\ElasticquentCollection))

13 [internal function]: Illuminate\Database\Eloquent\Builder->chunk(1000, Object(Closure))

14 /var/www/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Model.php(3526): call_user_func_array(Array, Array)

15 [internal function]: Illuminate\Database\Eloquent\Model->__call('chunk', Array)

16 [internal function]: App\Order->chunk(1000, Object(Closure))

17 /var/www/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Model.php(3540): call_user_func_array(Array, Array)

18 /var/www/app/Muaz/OrderApi.php(16): Illuminate\Database\Eloquent\Model::__callStatic('chunk', Array)

19 /var/www/app/Muaz/OrderApi.php(16): App\Order::chunk(1000, Object(Closure))

20 /var/www/app/Console/Commands/OrderEs.php(54): App\Muaz\OrderApi->esLoadData(1000)

21 [internal function]: App\Console\Commands\OrderEs->handle()

22 /var/www/vendor/laravel/framework/src/Illuminate/Container/Container.php(507): call_user_func_array(Array, Array)

23 /var/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(169): Illuminate\Container\Container->call(Array)

24 /var/www/vendor/symfony/console/Command/Command.php(256): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

25 /var/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(155): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

26 /var/www/vendor/symfony/console/Application.php(794): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

27 /var/www/vendor/symfony/console/Application.php(186): Symfony\Component\Console\Application->doRunCommand(Object(App\Console\Commands\OrderEs), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

28 /var/www/vendor/symfony/console/Application.php(117): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

29 /var/www/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(107): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

30 /var/www/artisan(36): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

31 {main}

scripta55 commented 8 years ago

@alongmuaz that looks like a time out, are you running the command where the db and elastic is on the same box or over http

also, decrease the amount you are chunking, that could help

alongmuaz commented 8 years ago

@scripta55 , hi, sorry for late, i create new helpers function and use the Models , where inside models have elastic in it . Thanks for help .

alongmuaz commented 8 years ago

@scripta55 , your progress bar not working :)

alongmuaz commented 8 years ago

i got new error :

Elasticsearch\Common\Exceptions\ServerErrorResponseException

[2016-09-13 06:00:09] local.ERROR: exception 'Elasticsearch\Common\Exceptions\ServerErrorResponseException' in /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php:673 Stack trace:

0 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(636): Elasticsearch\Connections\Connection->tryDeserializeError(Array, 'Elasticsearch\C...')

1 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(599): Elasticsearch\Connections\Connection->tryDeserialize500Error(Array)

2 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(265): Elasticsearch\Connections\Connection->process5xxError(Array, Array, Array)

3 /var/www/vendor/react/promise/src/FulfilledPromise.php(25): Elasticsearch\Connections\Connection->Elasticsearch\Connections{closure}(Array)

4 /var/www/vendor/guzzlehttp/ringphp/src/Future/CompletedFutureValue.php(55): React\Promise\FulfilledPromise->then(Object(Closure), NULL, NULL)

5 /var/www/vendor/guzzlehttp/ringphp/src/Core.php(341): GuzzleHttp\Ring\Future\CompletedFutureValue->then(Object(Closure), NULL, NULL)

6 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(283): GuzzleHttp\Ring\Core::proxy(Object(GuzzleHttp\Ring\Future\CompletedFutureArray), Object(Closure))

7 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(159): Elasticsearch\Connections\Connection->Elasticsearch\Connections{closure}(Array, Object(Elasticsearch\Connections\Connection), Object(Elasticsearch\Transport), Array)

8 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Transport.php(106): Elasticsearch\Connections\Connection->performRequest('POST', '/_all/_bulk', Array, '{"index":{"_id"...', Array, Object(Elasticsearch\Transport))

9 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Endpoints/AbstractEndpoint.php(80): Elasticsearch\Transport->performRequest('POST', '/_all/_bulk', Array, '{"index":{"_id"...', Array)

10 /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Client.php(744): Elasticsearch\Endpoints\AbstractEndpoint->performRequest()

11 /var/www/vendor/elasticquent/elasticquent/src/ElasticquentCollectionTrait.php(40): Elasticsearch\Client->bulk(Array)

12 /var/www/app/Muaz/OrderApi.php(15): Elasticquent\ElasticquentCollection->addToIndex()

13 [internal function]: App\Muaz\OrderApi->App\Muaz{closure}(Object(Elasticquent\ElasticquentCollection))

14 /var/www/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Builder.php(372): call_user_func(Object(Closure), Object(Elasticquent\ElasticquentCollection))

15 [internal function]: Illuminate\Database\Eloquent\Builder->chunk(500, Object(Closure))

16 /var/www/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Model.php(3526): call_user_func_array(Array, Array)

17 [internal function]: Illuminate\Database\Eloquent\Model->__call('chunk', Array)

18 [internal function]: App\Order->chunk(500, Object(Closure))

19 /var/www/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Model.php(3540): call_user_func_array(Array, Array)

20 /var/www/app/Muaz/OrderApi.php(16): Illuminate\Database\Eloquent\Model::__callStatic('chunk', Array)

21 /var/www/app/Muaz/OrderApi.php(16): App\Order::chunk(500, Object(Closure))

22 /var/www/app/Console/Commands/OrderEs.php(59): App\Muaz\OrderApi->esLoadData(500)

23 [internal function]: App\Console\Commands\OrderEs->handle()

24 /var/www/vendor/laravel/framework/src/Illuminate/Container/Container.php(507): call_user_func_array(Array, Array)

25 /var/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(169): Illuminate\Container\Container->call(Array)

26 /var/www/vendor/symfony/console/Command/Command.php(256): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

27 /var/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(155): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

28 /var/www/vendor/symfony/console/Application.php(794): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

29 /var/www/vendor/symfony/console/Application.php(186): Symfony\Component\Console\Application->doRunCommand(Object(App\Console\Commands\OrderEs), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

30 /var/www/vendor/symfony/console/Application.php(117): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

31 /var/www/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(107): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

32 /var/www/artisan(36): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))

33 {main}

fosron commented 7 years ago

Ok, so from my experience, it's a better idea to use official php elastic client's bulk feature + eloquent chunking. Simple foreaching and addToIndex took 3 minutes on 13k products (with alot of relations, something like 10 querys on 1 product), with chunking it got to 1 min 30 s and with chunking + bulk it got down to 60 - 70 s. Elasticquent really needs to add ability to bulk update, i'm thinking about doing a pull request to address this.

andrewmclagan commented 7 years ago

For those concerned.

You should chunk your index requests using eloquent ->chunk() functionality. The size of your chunks will depend on the resources allocated to your Elasticsearch installation.

For example: We have only a single index, that index holds about 70,000 - 100,000 documents. We run an Elasticsearch cluster of 6 nodes (3 data, 3 master) each node as 2.5GB allocated, across the physical machines, one master and one data node per machine. When indexing 70,000 documents we chunk our results to segments of 1000 per index request. This is allot but we can do this because we also have a loadbalancer that distributes the workload among our cluster.

Mostly on single node Elasticsearch setups with less then 2GB of memory indexing is an intense operation. On a single node system your 1 node has to: index documents, serve the HTTP API, store the data to memory and perform garbage collection. On a clustered setup these roles are seperated among nodes. Therefore you should be breaking your index chunks into a MUCH smaller size ~50 to ~100 per request. ES will actually queue the index requests internally so try not to push index requests onto a Laravel queue. Do it all synchronously. It will take a while.

Basically Elasticsearch is an "in memory" search platform, that means if you have allot of documents you need ALLOT of memory. The op says:

16 gigabytes ; 23,000,000 rows on 1GB AWS EC2

This is just not going to happen IMO. I would go for something like 8GB minimum preferably 16GB. Remember to configure your ES_HEAP_SIZE to no more then HALF your system memory or it will cause serious issues as your operating system will run dry of memory. Further more Elasticsearch is first and foremost a clustered technology, production setups should always be in a cluster and not a single node.

I say this through years of hard won trial and error on high volume production ES systems. If you want go ahead and learn this again yourself, but you will end up in the same place.

If anyone is interested my latest here is a gist with our indexing service:

https://gist.github.com/andrewmclagan/7525400afe2e6ecd74c432e7ca40fd41

the crux of the chunking routine is found here:

    /**
     * Runs an operation on a collection
     *
     * @param \Illuminate\Support\Collection $collection
     * @param Boolean $delete
     * @return Array
     */
    protected function collectionOperation(Collection $collection, $delete = false)
    {
        $chunks = $collection->chunk($this->chunkSize);
        $results = [];
        $counter = 1;
        $totalChunks = $chunks->count();
        dump("Chunks discovered: {$chunks->count()} with a size of: {$this->chunkSize}");
        foreach ($chunks as $chunk) {
            dump("Executing chunk: {$counter}/{$totalChunks}");
            $params = $this->getParamsByCollection($chunk, $delete);
            $result = $this->elasticsearch->bulk($params);
            $results = array_merge($results, $result); // this is faulty we need a proper merge routine.
            $counter++;
        }
        return $results;
    }
phil-la commented 4 years ago

Hi there,

long time passed by. Is there option since then to use update by query api? https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html