Open alongmuaz opened 8 years ago
I have the same situation... the model::addAllToIndex causes a 502 gateway, basically times out
the timeout error is cause by server, the server cant query all the data simultaneously , there is must a proper way to index the big data .
I don't know better solution than using chunks for big data. I've got own question with example - you may take example from there
Hi @Dmitri10 , thanks for your reply , i have 23 millions of rows data , I try to add to index , 1 hour able to index around 10k record, can you imagine how many days to take for 23 millions of data?
@alongmuaz 10 thousand in one hour sounds too long, i managed to sort my issue out, and this was the solution. 446 thousand records took about 3mins. This was in a command i could execute on the server by the way.
` /*ORGS*******/ $orgsApi = new OrgsAPI(); $this->info('Starting elastic indexing....');
$orgsApi->esCreateIndex();
$orgCount = DB::table('orgs')->count();
$this->info("Number of orgs: " . $orgCount);
$this->info('Adding org documents...');
$bar = $this->output->createProgressBar(1);
$orgsAmountPerChunk = 1000;
$orgsApi->esLoadData($orgsAmountPerChunk);
$this->comment("HP: Orgs Indexed");
$bar->advance();
$bar->finish();
$this->info(' HP Orgs Indexing Complete!');
$check = new DailyChecks();
$check->date = date('Y-m-d');
$check->process = "hp index - orgs";
$check->save();`
Hi @scripta55 , do you run these code using php artisan tinker? thank you
@alongmuaz yes, check this out: https://laravel.com/docs/5.3/artisan#writing-commands
@alongmuaz, @scripta55 is right - your operation is too long for 10k records. For example I've got extra logic
Route::get('fullRefreshIndex', function () {
set_time_limit(0);
ini_set('max_execution_time', 0);
try {
\App\Models\User::deleteIndex();
} catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
print_r('no such index');
}
try {
\App\Models\User::createIndex();
} catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
print_r('cant create index');
}
$cards = \App\Models\CardType::whereStatus('active')->get();
$cardsArray = [];
$chunkSize = config('app.chunk_user_model');
foreach ($cards as $card) {
$cardsArray[] = $card->id;
}
$users = \App\Models\User::whereStatus('active')
->whereIn('card_type_id', $cardsArray);
$users->chunk($chunkSize, function ($users) {
print_r('new chunk');
echo '<br/>';
// $users->each(function ($user) {
// $user->addToIndex();
// });
/* just saw that it would be a little faster than looping */
$users->addToIndex();
});
});
and for 160k records it took 4 minutes...
@scripta55, could you show us your function esLoadData please?
$orgsApi->esLoadData($orgsAmountPerChunk);
@Dmitri10 this could be slowing it down?
$users->each(function ($user) {
$user->addToIndex();
});
You can use the bulk function that basically takes the chunk and inserts into elastic, also i noticed the more you chunk the more time the worker takes to load into memory before inserting, so as low as possible as you can go on the chunk the better.
as for function in esLoadData:
ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });
@scripta55 Thanks. Yes, I updated my code before your answer (in comments) and showed another example with using addToIndex to builder 4 days ago but little chunks don't work faster for me, so if you use nothing new.. then it seems your server has better RAM 👍
@Dmitri10 possibly so, i ran this test on a virtual box with 2 gigs ram
you can directly index from sql to elastic, that could work for you? using JDBC it is what i intend to do at a later stage
@scripta55 Yes, I've got only 1 gb ram) thanks a lot for JDBC, I didn't know about it anything!
Enjoy! it would be a great solution without the overhead of laravel
@scripta55 , you mean Logstash - JDBC?
There are cool examples with it, its standalone; it runs on your server/box independently
Cool library except one thing - there are too much opened issues now and you may catch one of them and wait fix...
@Dmitri10 true, however most of those issues are just lack of understanding from most users. Its a really straight forward solution to execute :) When i tried it out, i had lots of questions simply because my mindset was not on track with how it works and what it simply is supposed to do. give it a go!
Thanks one more time anyway! If i have free time I'll try it.
@scripta55 , where the OrgsAPI() referring to?
@alongmuaz , It's his own class where he added his functions esCreateIndex(), esLoadData(), etc.
And he told you about main function for indexing data using model and chunks:
as for function in esLoadData: ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });
hai @scripta55 , i got this error when using your codes :
[Elasticsearch\Common\Exceptions\ServerErrorResponseException]
hi @alongmuaz are you able to curl to you able to curl to the elastic server? from outside your homestead or http get from elastic
your result should be like this: http://d.pr/i/1hExn
hi @scripta55 , yes , actually according to aws es , theres data inserted into nodes . suddenly when running your codes after 10 minutes .
http://192.168.10.10:9200/YOUR-INDEX-NAME/_count
please share the results you get from there. Also how many records are you intending to index into elastic?
did you load 2775000 on the first try? what is your total? and server memory/space is not all used up?
yes , total is 3718988 , ram 4gb .
mhh... please provide your trace?
i did not understand, what is the trace? forgive me :D
stack trace :) error report
@scripta55 , finally https://d.pr/P8IW , thanks!
@alongmuaz did you just rerun? and it carried on or you found out what the issue was?
@scripta55 , I just rerun, today i try to index another table and there is error :
[Elasticsearch\Common\Exceptions\RequestTimeout408Exception]
Trace error :
[2016-09-09 03:11:09] local.ERROR: exception 'Elasticsearch\Common\Exceptions\RequestTimeout408Exception' in /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php:570 Stack trace:
@alongmuaz that looks like a time out, are you running the command where the db and elastic is on the same box or over http
also, decrease the amount you are chunking, that could help
@scripta55 , hi, sorry for late, i create new helpers function and use the Models , where inside models have elastic in it . Thanks for help .
@scripta55 , your progress bar not working :)
i got new error :
Elasticsearch\Common\Exceptions\ServerErrorResponseException
[2016-09-13 06:00:09] local.ERROR: exception 'Elasticsearch\Common\Exceptions\ServerErrorResponseException' in /var/www/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php:673 Stack trace:
Ok, so from my experience, it's a better idea to use official php elastic client's bulk feature + eloquent chunking. Simple foreaching and addToIndex took 3 minutes on 13k products (with alot of relations, something like 10 querys on 1 product), with chunking it got to 1 min 30 s and with chunking + bulk it got down to 60 - 70 s. Elasticquent really needs to add ability to bulk update, i'm thinking about doing a pull request to address this.
For those concerned.
You should chunk your index requests using eloquent ->chunk()
functionality. The size of your chunks will depend on the resources allocated to your Elasticsearch installation.
For example: We have only a single index, that index holds about 70,000 - 100,000 documents. We run an Elasticsearch cluster of 6 nodes (3 data, 3 master) each node as 2.5GB allocated, across the physical machines, one master and one data node per machine. When indexing 70,000 documents we chunk our results to segments of 1000 per index request. This is allot but we can do this because we also have a loadbalancer that distributes the workload among our cluster.
Mostly on single node Elasticsearch setups with less then 2GB of memory indexing is an intense operation. On a single node system your 1 node has to: index documents, serve the HTTP API, store the data to memory and perform garbage collection. On a clustered setup these roles are seperated among nodes. Therefore you should be breaking your index chunks into a MUCH smaller size ~50 to ~100 per request. ES will actually queue the index requests internally so try not to push index requests onto a Laravel queue. Do it all synchronously. It will take a while.
Basically Elasticsearch is an "in memory" search platform, that means if you have allot of documents you need ALLOT of memory. The op says:
16 gigabytes ; 23,000,000 rows on 1GB AWS EC2
This is just not going to happen IMO. I would go for something like 8GB minimum preferably 16GB. Remember to configure your ES_HEAP_SIZE to no more then HALF your system memory or it will cause serious issues as your operating system will run dry of memory. Further more Elasticsearch is first and foremost a clustered technology, production setups should always be in a cluster and not a single node.
I say this through years of hard won trial and error on high volume production ES systems. If you want go ahead and learn this again yourself, but you will end up in the same place.
If anyone is interested my latest here is a gist with our indexing service:
https://gist.github.com/andrewmclagan/7525400afe2e6ecd74c432e7ca40fd41
the crux of the chunking routine is found here:
/**
* Runs an operation on a collection
*
* @param \Illuminate\Support\Collection $collection
* @param Boolean $delete
* @return Array
*/
protected function collectionOperation(Collection $collection, $delete = false)
{
$chunks = $collection->chunk($this->chunkSize);
$results = [];
$counter = 1;
$totalChunks = $chunks->count();
dump("Chunks discovered: {$chunks->count()} with a size of: {$this->chunkSize}");
foreach ($chunks as $chunk) {
dump("Executing chunk: {$counter}/{$totalChunks}");
$params = $this->getParamsByCollection($chunk, $delete);
$result = $this->elasticsearch->bulk($params);
$results = array_merge($results, $result); // this is faulty we need a proper merge routine.
$counter++;
}
return $results;
}
Hi there,
long time passed by. Is there option since then to use update by query api? https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html
Hi,
I have a few table in database, total size is 16 gigabytes ; 23,000,000 rows .
I deploy it on AWS ECE2 ; 1gb ram .
If anyone get an idea how to index all of it with fast as possible?
thank you