floriansemm / SolrBundle

Solr-Integration into Symfony and Doctrine2
http://floriansemm.github.io/SolrBundle
MIT License
123 stars 72 forks source link

Out of Memory Exception with large dataset #157

Closed Mortagus closed 7 years ago

Mortagus commented 7 years ago

Hello everyone,

I'm just starting to use SolrBundle 1.5 with Symfony 3.1.* and I have problem of memory usage.

When I call the solr:index:populate command, I have a table (mysql) with almost 560K rows that trigger an OutOfMemoryException.

I already increase the memory_limit parameter from php.ini file to 512Mo without success.

Can somebody help me to optimize the commande code in order to save as much memory as possible ?

Have a nice day,

best regards

Benjamin

floriansemm commented 7 years ago

can you upgrade to version 1.6.*? It include a couple of performance improvements.

Mortagus commented 7 years ago

I'm going to check with my peers, but I think it won't be a problem. I'll keep you posted.

Mortagus commented 7 years ago

First of all, thank you for your response. I was afraid to wait much longer before your answer :-).

Secondly, I updated my bundle version and I tried to populate the solr index with the command. Here is my error message :

Synchronize 559206 entities

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 12288 bytes) in >/var/www/vendor/doctrine/orm/lib/Doctrine/ORM/Internal/Hydration/AbstractHydrator.php on line 296

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 32768 bytes) in > /var/www/vendor/symfony/symfony/src/Symfony/Component/Debug/Exception/OutOfMemoryException.php on line 1

Can I ask you : on what size of data this bundle has been tested ?

floriansemm commented 7 years ago
  1. increase your memory limit
  2. run ./bin/console solr:index:populate --flushsize=200

The bundle is tested with ~61k rows.

Mortagus commented 7 years ago

I already increase the memory limit up to 512Mo. I think it's already high enough.

And I forget to mention it but I already try to reduce the flushsize parameter. Here is the call I've made :

php bin/console solr:index:populate --flushsize=50 Bundle:Entity

Anything else ?

floriansemm commented 7 years ago

Can you increase the limit just for the import? To import stuff needs always a lot of memory.

Mortagus commented 7 years ago

Do you have a specific amount to recommend ?

floriansemm commented 7 years ago

you need min. 2-3GB for your 520k

Mortagus commented 7 years ago

I will try that then.

But don't you think it's too much ? If it's not possible to avoid that situation, can I suggest to add a parameter to the command to pilot that memory_limit ?

floriansemm commented 7 years ago

Don't be afraid. The main-problem is that Symfony and Doctrine allocate a lot memory. I had the same problem a few months ago.

You have to set a new memory_limit (ini_set) at the beginning of a script, in this case in the beginning of bin/console. So a optional command parameter will not work.

Mortagus commented 7 years ago

I will check with my infrastucture guy, because even if I'm using a memory_limit = -1, I still have the same error about the 512Mo limit outreached.

If I specify 2048M within the php.ini file, it's not accepted and the system goes back to 512M values.

Anyway, I'll continue to work on that and I'll keep you posted ;-)

Mortagus commented 7 years ago

Ok,

I tried with a memory_limit = 2048M and it took more than 1 hour to crash.

I will make another test with the limit increase up to 4096M.

But I don't have the time right know. I'll keep you posted ;-)

Koalabaerchen commented 7 years ago

Stupid question maybe: what php version are you on? As far as I know a single php process (lower than PHP 7) cannot allocate more than 2G. Even on 64bit compiled versions. So bigger than 2G won't do much. That affects post sizes as well as memory used. There are patches for that for PHP lower than 7. It is supposed to be fixed in php7 (but haven't tested that). Please correct me if I'm wrong.

Also: the Entity Manager of Doctrine does some weird stuff. If the populate command won't work, try your own command. Get the entities in smaller batches and detach() the entites from the EntityManager after you put them into Solr.

floriansemm commented 7 years ago

I think the best way to import your data is to write a little php-script (no symfony-command) with some PDO magic to export your data as json. Then you can use the solr post tool to import your data.

Mortagus commented 7 years ago

@Koalabaerchen : this is not a stupid question at all ;-). I should have tell that info at first : I'm using PHP 7.

I'm trying the command with 4Go. If it's not working, I'm going to develop my own logic

Mortagus commented 7 years ago

@floriansemm Hello,

I tried the command after set the memory_limit up to 4go and that's taking so much time, I can't use it like this.

After thinking, I've got a question : do you think it would be possible to implement an incremental version of the populate command ? Right now, it give me the impression that each time it's called, it recreate the index from the beginning. But my large dataset of 560k rows is not the only one I've got of that size (I've got another one as large).

It could be a nice addon to the bundle. What do you think about that ?

Mortagus commented 7 years ago

That said, I just remembered about a SynchronizationFilter annotation that could be the answer to my problem :-)

Mortagus commented 7 years ago

If I want to get an incremental command, I need to be able to get the id of the last solr document inserted.

How can I query that ? I can't find the documentation I need.

floriansemm commented 7 years ago

a command like ./bin/console solr:index:populate AppBundle:Entity --start-offset=100 --stop-offset=300 is possible.

to get the last insert document: http://stackoverflow.com/questions/12346240/how-to-get-last-indexed-record-in-solr

I have implement a prototyp command which can index a huge amount of rows/documents (entity with 3 fields, 5000000 rows). My local setup has 2GB RAM and it took 15min to export/import all rows.

You can ping if you want some information about how it works.

Mortagus commented 7 years ago

@floriansemm thanks for your link, but this is not what I searched for.

I need to send the same request but with the PHP Solr QueryBuilder. I can't find how to do that.

Your implementation sounds very interesting. And also very close to my own situation. Yes, I would like to know how you did it. Because, last time I tried to import my data, it wasn't finished after three full hours.

floriansemm commented 7 years ago

The script selects 100000 rows from the table person and writes it to a csv-file. My vagrant-box has 2GB and 2 CPUs and needs for the export ~15min.

        $threshold = 100000;
        $rows = 5000000;
        $page = ceil($rows / $threshold);

        for ($i = 0; $i<$page; $i++) {
            $statement = $this->getContainer()->get('doctrine.dbal.default_connection')
                ->executeQuery(sprintf('select id, name, email from person limit %s, %s', $threshold * $i, $threshold));

            $statement->execute();
            $rows = $statement->fetchAll();
            $data = array();
            foreach ($rows as $row) {

                $data[] = sprintf("\"person_%s\", \"%s\", \"%s\"\n", $row['id'], $row['name'], $row['email']);
            }

            $fileName = '/tmp/test/data'.$i.'.csv';
            file_put_contents($fileName, "id, name_s, email_s\n");
            file_put_contents($fileName, implode('', $data), FILE_APPEND);

            $output->writeln('write page '. $i);

            $statement->closeCursor();
        }

When its done I call the solr post-tool. All files under /tmp/test will be indexed.

./post -c core0 /tmp/test/
Mortagus commented 7 years ago

I think there is a misunderstand : I know how to get data with doctrine.

I thought you were talking about Solr. I misread your message.

floriansemm commented 7 years ago

Yes we talk about Solr. The last code snippet shows how I add the documents to my solr-core. I use for this the solr post tool

Mortagus commented 7 years ago

Ok, I think we have a good track here.

Do you think it's possible to use the Solr Client PHP to inject data into the index from anything else than Entity classes ?

floriansemm commented 7 years ago

This bundle is designed for Doctrine Entities. Other datasources are not supported, but you can implement your own stuff and use the underlying Solarium Lib to index documents.

Mortagus commented 7 years ago

Hello,

just to keep you posted: I tried first to implement my own solution to use the Solr Client from the Solarium library. But I'm struggling to understand how to use in order to have what I want.

So, just to know if it was a better solution, I implement you idea of using a bunch of csv files and import them with the post tool, like you suggested.

My conclusion is : even you suggested solution is slow... so slow. I was optimistic, but my table is so large, just the process to create the csv files is slowing down as and when the files are created.

That is unexpected, by the way : as and when the files are created, the process is taking more time at each file. I don't understand why.

Just for you to know : my table has 535 882 rows and each rows is composed of 179 columns with some text fields which can be pretty long... my use case is just a nightmare...

floriansemm commented 7 years ago

I think every export-script will cash, but you can try something like this. The export is implemented directly in mysql-land. It is really fast Query OK, 5000000 rows affected (14.40 sec)!

If you really want this export/import thing then you have to clean up your table.

Mortagus commented 7 years ago

In order to implement my own command to populate the Solr index, I have a question about the already existing populate command.

I'm trying to get the maximum value of the 'id' field of a specific entity. Something strange was returned: 999999. This is strange because it's not the highest value in the database and I couldn't find anywhere an explanation about that.

Does somebody have an explanation please ?

floriansemm commented 7 years ago

https://github.com/floriansemm/SolrBundle/blob/master/Repository/Repository.php#L71

If you use the repository method findAll.

Mortagus commented 7 years ago

yes, good point. Thanks for you answer, but this is not the problem.

Here is what request I wrote: $this->hydrationMode = HydrationModes::HYDRATE_INDEX; $query = $this->solr->createQuery($this->entityName); $query->setRows(1); $query->addSort('id', Query::SORT_DESC); $result = $query->getResult();

EDIT: I wrote my own Repository and this is a specific method I wrote in it ;-)

Mortagus commented 7 years ago

as you can see, I've setted

`$query->setRows(1);

If the limit is set at 1000000 how can I retrieve the max value of a specific field in a specific type of document ?

floriansemm commented 7 years ago

your query does not work because id is a string like documentname_1234. The easiest way to get the highest id is to use doctrine and the underlying database.

Mortagus commented 7 years ago

The idea here is to get the last inserted id from Solr, not from Doctrine. That said, I don't understand what you're saying : why the fact that my id is a string have an influence on the query's result ? The field values are always numbers, so even if it's a string, it shouldn't be limited by a number. Maybe by the length of the string ?

By the way, why the id, which is declared as an integer in the entity class, is represented as a string into Lucene ?

Mortagus commented 7 years ago

Hello @floriansemm ,

good news : I finally made a script that handle correctly the memory (it seems at least, my tests are not finished yet).

It's a bit complicated to explain all the code part, but basically, I copied a lot of your code, just to keep the same logic. It's possible that I made some changes, but only in the way the code is structured, not in the logic.

Then I added some calls to EntityManager::detach(), where I could. I also added calls to EntityManager::clear(), at the end of each batchLoop. Finally, I even made calls to the gc_collect_cycles() just to be sure that the memory is flushed at the end of each Entity index process.

But that wasn't the whole problem. I also had to manage the fact that I'm going through a very large table (170 columns, more than 550K rows). To manage that situation, I just create a specific DQL query to limite the quantity of rows manage by the entityManager. I found that webpage that explain clearly the problem and propose a solution to it : http://www.xarg.org/2011/10/optimized-pagination-using-mysql/

I hope it is going to help some people. If my explanations are not clear (which is completely possible). Feel free to ask anything.

floriansemm commented 7 years ago

Your problem is very special but I learned a lot from it. Thanks for the link, it is a very interesting article!

I have started a little tutorial how-to index large sets of entities.

Mortagus commented 7 years ago

I also try your suggestion about the mysql export, but my server didn't have the rights to export those files so it didn't work for me.

That said, the other problem was still there : my query was taking more time at each "page". Don't forget to include the logic of sorting your query on a column (id for example) and use the last handled value as a filter to skip already processed rows ;-)