FriendsOfSymfony / FOSElasticaBundle

Elasticsearch PHP integration for your Symfony project using Elastica.
http://friendsofsymfony.github.io
MIT License
1.25k stars 791 forks source link

Out of memory - populate #273

Closed armetiz closed 10 years ago

armetiz commented 11 years ago

Hi there, I'm testing FOSElasticaBundle & command populate

I get some out-of-memory problem with "huge" set of row, on my computer problems comes with around 100k rows on the RDBMS.

ElasticSearch is design to be used with more than 100k document.

Regards,

joelataylor commented 11 years ago

Experiencing this issue also --- any solutions?

jmikola commented 11 years ago

Is this with Propel or Doctrine? If the latter, which library and version?

Most of the memory issues reported have nothing to do with Elastica, but are related to processing a large number of Doctrine entities.

armetiz commented 11 years ago

I don't remember which version, but I've tested with Doctrine.

joelataylor commented 11 years ago

Hmmm, that's very possible. Here's our libs:

doctrine/annotations v1.1.1 Docblock Annotations Parser doctrine/cache v1.0 Caching library offering an object-oriented API for many cache backends doctrine/collections v1.1 Collections Abstraction library doctrine/common 2.4.0-RC3 Common Library for Doctrine projects doctrine/dbal 2.3.4 Database Abstraction Layer doctrine/doctrine-bundle v1.2.0 Symfony DoctrineBundle doctrine/doctrine-migrations-bundle dev-master 6891b85 Symfony DoctrineMigrationsBundle doctrine/inflector v1.0 Common String Manipulations with regard to casing and singular/plural rules. doctrine/lexer v1.0 Base library for a lexer that can be used in Top-Down, Recursive Descent Parsers. doctrine/migrations v1.0-ALPHA1 Database Schema migrations using Doctrine DBAL doctrine/orm 2.3.4 Object-Relational-Mapper for PHP

jmikola commented 11 years ago

/me sigh of relief that it's not ODM this time

I think this is due to memory leaks in the UnitOfWork (perhaps due to circular object references). If you remove Elastica from the equation by tweaking the bundle code, can you reproduce the memory leak simply by iterating through 100k+ entities?

mvrhov commented 11 years ago

If it's Doctrine2 then it's most definitely the circular reference problem. And I don't think there is a solution for that. The recommendations on doctrine website on how to process large datasets DO NOT work in case where your objects have a circular references.

evolchek commented 11 years ago

Same issue here. Doctrine. No circular references though.

mvrhov commented 11 years ago

I rewrote most of my import functions to use spork. The imports do get significantly slower but they complete without a hiccups.

jmikola commented 10 years ago

@evolchek: The circular reference issue is not necessarily caused by entity/document classes referring to each other, but rather references to UnitOfWork and other internal classes.

nmpolo commented 10 years ago

I've been having the same problem whilst trying to populate an index with ~4m documents coming out of MongoDB. I've been using the following command: php app/console fos:elastica:populate --no-debug --env=prod -q

It seems that logging is causing my problems. If I comment out $this->_logger = $logger; in Elastica\Client::setLogger, the issue no longer occurs. If anyone else has the same problem, you can easily fix it with the following code:

#app/config/config.yml
parameters:
    fos_elastica.client.class: Namespace\Elastica\Client
#src/Namespace/Elastica/Client.php
<?php

namespace Namespace\Elastica;

use Elastica\Client as BaseClient;
use Psr\Log\LoggerInterface;

class Client extends BaseClient
{
    public function setLogger(LoggerInterface $logger)
    {
        return $this;
    }
}

I haven't had time to look into why setting the logger causes this issue though so there could be a better/simpler solution.

caponica commented 10 years ago

Using Doctrine here and I get problems loading really "huge" fixtures. E.g. 10 records. (Not 10 million... just 10).

PHP doesn't run out of memory, but Java does. Usual memory usage starts at about 150Mb, then shoots up to over 1Gb and never comes down again. Fixture loading stalls then eventually (sometimes) explodes complaining about an Elastica timeout:

[Elastica\Exception\Connection\HttpException]
Operation timed out

The Elastica log shows lots of lines like this:

[2013-12-16 16:27:16,804][WARN ][monitor.jvm] [Sasquatch] [gc][ParNew][1048][169] duration [1.1s], 
  collections [1]/[4.9s], total [1.1s]/[7.8s], memory [671.7mb]->[764mb]/[990.7mb], 
  all_pools {[Code Cache] [3.7mb]->[3.7mb]/[48mb]}
  {[Par Eden Space] [91.7mb]->[72.7mb]/[266.2mb]}{[Par Survivor Space] [33.2mb]->[0b]/[33.2mb]}
  {[CMS Old Gen] [546.7mb]->[691.2mb]/[691.2mb]}{[CMS Perm Gen] [30.3mb]->[30.3mb]/[82mb]}
[2013-12-16 16:28:39,494][INFO ][monitor.jvm] [Sasquatch] [gc][ConcurrentMarkSweep][1068][30] duration [5s], 
  collections [1]/[5.2s], total [5s]/[1.4m], memory [923.6mb]->[925.2mb]/[990.7mb], 
  all_pools {[Code Cache] [3.7mb]->[3.7mb]/[48mb]}
  {[Par Eden Space] [232.4mb]->[233.9mb]/[266.2mb]}{[Par Survivor Space] [0b]->[0b]/[33.2mb]}
  {[CMS Old Gen] [691.2mb]->[691.2mb]/[691.2mb]}{[CMS Perm Gen] [30.3mb]->[30.3mb]/[82mb]}

Not very helpful! I'm hoping I've got something set up wrong, but not found anything wrong with the config so far...

damienalexandre commented 10 years ago

You are send huge documents? You should try reducing the bulk size then, with the batch-size option, and add some sleep between each batch.

Also, you should try to index those document outside the bundle, via cURL.

caponica commented 10 years ago

batch_size is set to 1 (for now).

The documents are not huge, here's an example (the rest are similar):

    $pageRealArticle2 = new Page();
    $pageRealArticle2->setPageTitle('Article page #2');
    $pageRealArticle2->setPageType(Page::TYPE_ARTICLE);
    $pageRealArticle2->setSiteFamilyId(1);
    $pageRealArticle2->addSite($this->getReference('site-real'));
    $pageRealArticle2->setShortcode(self::OLD_SHORTCODE_ART_2);
    $pageRealArticle2->setUrl(self::OLD_URL_ART_2);
    $pageRealArticle2->setPublicationDate(new \DateTime('2013-01-02'));
    $pageRealArticle2->setActiveLevelBySiteOwner(Page::ACTIVE_LEVEL_SHOW_PUBLIC);
    $pageRealArticle2->setSummaryText('Summary text for article page 2');
    $manager->persist($pageRealArticle2);

All the class constants are simple strings or ints.

The fos_elastica config is:

fos_elastica:
    clients:
        default:                  { host: localhost, port: 9200 }
    serializer:                   ~   # leaving it blank like this enables the use of KnpPaginator
    indexes:
      website:
        client:                   default
        index_name:               %elastica_index_name%
        types:
          page:
            mappings:
              pageTitle:          { boost: 5 }
              summaryText:        { boost: 3 }
              introText:          { boost: 3 }
              text1:              { boost: 2 }
              text2:              { boost: 2 }
              metaDescription:    { boost: 1 }
              metaKeywords:       { boost: 1 }
            persistence:
              driver:             orm # orm, mongodb, propel are available
              model:              Acme\MyBundle\Entity\Page
              provider:
                query_builder_method: createElasticSearchQueryBuilder
                batch_size:           1
              listener:
                is_indexable_callback: 'isElasticSearchIndexable'
              finder:             ~   # enables retrieval of Doctrine entities via fos_elastica.finder.[index].[type] service

(Edit: I'm guessing the serializer is where things are falling down... doing some reading around that atm!)

caponica commented 10 years ago

OK, the problem I mentioned above is a new one (didn't used to happen). I've tracked it down as far as this:

https://github.com/FriendsOfSymfony/FOSElasticaBundle/commit/d546b4d3f3aad047aac4329f34f57add691733e4#diff-e0950e36ea63dd2c4b5151242670298c

Before this version of this file everything works quickly with no problems. However, the changes in this commit (in the DependencyInjection/FOSElasticaExtension.php file) seem to cause the slow-down and fixture loading grinds to a halt, even with my modest test data.

Can somebody wiser than I am look into this?

merk commented 10 years ago

Try disabling the serializer

caponica commented 10 years ago

Thanks @merk - removing the serializer sorts out the fixture loading.

What is the serializer entry for in the config.yml and when should it (not) be used?

merk commented 10 years ago

The serializer allows the bundle to automatically convert objects to json and send it directly to Elasticsearch, meaning you dont need to define mappings for types.

You do however need to define JMS Serializer metadata to each entity you're indexing otherwise the bundle will try to serialize the entire object graph which is not what you want.