bcit-ci / CodeIgniter

Open Source PHP Framework (originally from EllisLab)
https://codeigniter.com/
MIT License
18.27k stars 7.61k forks source link

$query->result() memory exhausted #1351

Closed juanitomint closed 12 years ago

juanitomint commented 12 years ago

While geting large results (4k rows) i get from xdebug:

( ! ) Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 38 bytes) in /home/ci2/public_html/system/database/drivers/mysql/mysql_result.php on line 167 Call Stack

Time Memory Function Location

1 0.0005 347340 {main}( ) ../index.php:0 2 0.0016 410180 require_once( '/home/ci2/public_html/system/core/CodeIgniter.php' ) ../index.php:202 3 0.0349 2320000 call_user_func_array ( ) ../CodeIgniter.php:359 4 0.0349 2320072 Testdb->run_test( ) ../CodeIgniter.php:359 5 0.3494 2418348 CI_DB_result->result( ) ../testdb.php:18 6 0.3494 2418424 CI_DB_result->result_object( ) ../DB_result.php:51 7 10.8952 134194960 CI_DB_mysql_result->_fetch_object( ) ../DB_result.php:119 8 10.8952 134195004 mysql_fetch_object ( ) ../mysql_result.php:167

The problem is that DB_result.php is trying to add all rows to the object it returns, exhausting memory

I provide a controller to test this out:

<?php //----file name: /controllers/testdb.php if (!defined('BASEPATH')) exit('No direct script access allowed');

class Testdb extends CI_Controller {

public function index() {
    echo "<h1>DB Memmore Exhaustion TEST (128Mb)</h1>";
    echo "<ol>";
    echo "<li><a href='/testdb/create_db'>Create a table with 4k records</a></li>";
    echo "<li><a href='/testdb/run_test'>Run test</a></li>";
    echo "</ol>";
}
public function run_test() {
    $query=$this->db->get('ci_mem_test');

    foreach($query->result() as $row){//<---- This line will exhaust 128Mb

    }
    echo "yes!";
}
public function create_db() {
    //---load dbforge
    set_time_limit(3600);
    $this->load->dbforge();

    $this->dbforge->drop_table('ci_mem_test');
    $this->dbforge->add_field('id');
    $this->dbforge->create_table('ci_mem_test');
    $object=new stdClass();
    $object->id=0;
    for($i=1;$i<=400000;$i++){
        $this->db->insert('ci_mem_test',$object);
    }
    //$this->output->enable_profiler(TRUE);
}

}

/* End of file testdb.php / / Location: ./application/controllers/testdb.php */

narfbg commented 12 years ago

OK, you've managed to do almost everything in order to get this script as much memory intensive as possible. :)

About a week ago a similar issue (#1318) was reported as a "memory leak", only it was just about the inserts. You can read my comments on it here: https://github.com/EllisLab/CodeIgniter/issues/1318#issuecomment-5547161

Your case however deals with fetching data as well and there's more stuff that comes in play here - keeping an actual result set in PHP's allocated memory. Obviously, simple_query() won't help with a SELECT statement, but again - it should be used (if you don't want to use insert_batch() that is) for the inserts, and here's another trick ... you can disable the query history. This is undocumented, but there are two ways to do that:

... and of course, when dealing with large result sets - you should try to optimize all of your code. Instead of the foreach() loop, you can use this:

$query = $query->result();
for ($i = 0, $c = count($query); $i < $c; $i++)
{
    // access $query[$i] instead of $row
}

Note that it's important to reuse the original variable that you've set the query result to as if the result objects set is the only thing you need from it - assigning it to the same variable effectively deletes the old one, which has other stuff in it as well. And the for() loop will directly access the rows, while foreach() creates their copies on each iteration.

And now the obvious stuff - does somebody seriously need to fetch 400000 rows in a web application? I suspect that this probably comes from a forum thread and that's why it's been reported twice in a 1-2 week period, while it has never been mentioned before. If that's the case - please point at this issue's URL. :)

ckdarby commented 12 years ago

@narfbg This wasn't from a forum thread; My web crawler does 400K selects & a max of 400K row inserts... it is used to find collisions & changes to contact information.

At the previous company I worked at when we were crunching toning on articles & had to generate averages, stats, commonalities we were selecting +15 million rows.

ghost commented 12 years ago

Only select the bare minimum columns. Use limit clauses to break queries in to small chunks. Bypass CI db class and write something yourself that is more memory efficient. Use something provided by the db to help - maybe a view. Do what your prev company did to handle +15 mil rows.

ckdarby commented 12 years ago

@gdragffy I am selecting the minimum & limiting where I can. This isn't a matter of bypassing CI db class & writing something myself this is a matter of CI's db class needs to be able to scale properly.

juanitomint commented 12 years ago

@ckdarby totally agree with you, it is not fear to blame the process or the needs, the library is written in a way that can be symetrically used for small or big results.

@narfbg: i use to run this same process with PHP and 8mb memory using ADODB and: <?php $query=$conn->Execute($SQL) or DIE("
$SQL
"); while (!$query->EOF) { ?>

in the code you posted above: $query->result(); still exhaust the memmory is there a similar construction for CI like the ones in ADODB? ($query->$FetchRow(),$query->EOF....)

juanitomint commented 12 years ago

@gdragffy If i have to code around CI then I rather choose another framework, but I love the way CI works and I'm willing to be able to use it on as many scenarios as possible. Besides I'm leading a team of developers and we're trying to switch to CI and we want consistency in the way the code is written (and readed)

ckdarby commented 12 years ago

@juanitomint Pointed out the same with the insert on ticket #1318

juanitomint commented 12 years ago

@ckdarby yeah I hit the same bug/defect but try to addres this one first because it seemed more severe to me.

ghost commented 12 years ago

@ckdarby did you try exactly the same query using PHP's built in MySQL functions - did it exhaust memory also?

narfbg commented 12 years ago

I agree that CI could scale better and I have some ideas about new features that would help in that direction. But this would again require different usage and it's exactly a more effective usage that we're actually talking about here, not a bug. All of the functionalities that the framework provides come at a certain price - you can't expect it to be as fast as raw PHP code. The code that you've shown in the issue description will use an amount of memory 4 times larger than the actual data. Simply by using simple_query() for the inserts or turning save_queries off, you'll reduce the memory consumpion by at least 25%. Replace the foreach() loop with a for() similar to what I've shown and you now have 50% of that. As I already said - foreach() will create a copy for each of the result array elements before accessing them and PHP's garbage collector won't delete them instantly. foreach() might be easier to use, but again - this comes at a price. And with the price being copying 400k (yes, it's 400k - not 4k) rows - it's not a good trade off.

timw4mail commented 12 years ago

@narfbg you can do something like this

foreach( $var as &$ref)

which also saves memory, and is still valid in newer versions of php

narfbg commented 12 years ago

@timw4mail It saves memory compared to the version without references, not compared to for(). But that's not the point - I'm trying to explain that you can't write code that copies the same huge data set over and over again and expect it to magically get a performance boost. :)

toopay commented 12 years ago

Agree with @narfbg , this issue shows how functionally murderous the foreach() loop can be.

juanitomint commented 12 years ago

@gdragffy : I have this process, but with more than 1M rows, working perfectly with 8mb for PHP memory limit and ADODB. if you don't load every record into memory then you don't have a problem with the resulset size. But my concern is about consistency in the way the code is written. I don't have a problem whith the job, CI does. How can I help? may I rewrite some of the DB functionalities?

juanitomint commented 12 years ago

well i was reading ADODB lib and they rely a lot on the EOF var so I came with this mod to the class DB_result.php you need to add:

var $EOF = false; at the beginning of the class and the function next_row() becomes:

public function next_row($type = 'object')
{
            if ($this->EOF) {
        return $false;
    }

            if($type=='object'){
                $rtnres=$this->_fetch_object();
            } else {
                $rtnres=$this->_fetch_assoc();
            }

    if (($this->current_row + 1)<$this->num_rows())
    {
        ++$this->current_row;

    } else {
                $this->EOF=true;
            }
                    return $rtnres;

}

Then I can walk over 1M records and there is no memory overhead (with 8Mb for PHP script)

while(!$query->EOF){ $row=$query->next_row('array'); //---run your process here }

ckdarby commented 12 years ago

@gdragffy @juanitomint I attempted the same with PDO & it utilized about 1.2 mbs over all 400K records.

I understand CI should have some overhead compared to PDO but not as much as what has been outlined in this discussion.

narfbg commented 12 years ago

Is anybody reading me at all? The maximum length of the id field given in the issue description is 6 characters. Considering the generated query should be the following one:

INSERT INTO `ci_mem_test` (`id`) VALUES (?) --- query length is 42 without the id value length

Multiply that by 400000, add the ids (around 800k) and you'll get around 19mb of raw data (not counting what PHP actually adds in order to store it in memory). You'll save that just by setting save_queries to FALSE or using simple_query() for the inserts. And if you're using UTF-8 - you're doubling that number.

Plus, line 167 in _system/database/drivers/mysqldriver.php is an empty line in the current develop branch, just after the CI_DB_mysql_result class declaration ends. I can assure you that the latest code available here on the github repository behaves way better than whatever CI version you're using. Have you tried it?

juanitomint commented 12 years ago

@narfbg LOL! hahaha... I'm reading you, but I'm not focused on the insert problem right now, did you see what I did with next_row function ??? I've made it "results" independent, and it works like a charm. I will read the libs here instead of the ones in the distrib package to see if my mod is still valid. is there a test suite that I can run against this function?

narfbg commented 12 years ago

@juanitomint Considering that your example first creates a table and does 400k inserts - it is part of what you should focus on. :)

On next_row() (and you should be dealing with row() instead) - it's supposed to iterate over result_array and/or result_object's elements, so that can't go into CI. The CI_DB_result class is supposed to be able to return multiple types of results (an array of stdClass objects, an array of associative arrays, an array of custom objects, single rows, etc.) with only executing the query once, so pretty much everything in it depends on storing the result set in a variable first. That's why I mentioned new features in a previous comment and not modifying the existing code. In other words - I agree that there should be a way to fetch single rows without CI buffering them, but that just doesn't belong in the existing methods.

juanitomint commented 12 years ago

@narfbg

so pretty much everything in it depends on storing the result set in a variable first. never seen this on an abstraction layer, the result set is a pointer and each driver provides functions to navigate through without loading it into memory. take a look at: http://php.net/manual/en/book.mysql.php you probably already know this, but if you need to load the whole result set in the memory to replicate what drivers do without that overhead, then imho you're doing it wrong. But don't have to worry I've already forked my own CI and I'm working on it. Thanks for the help and for sharing your point of view!

narfbg commented 12 years ago

That's just how it's done in CI. Not because it's right or wrong, but because it's needed for flexibility. If you do something like e.g. unbuffered_row() in your fork and you're willing to contribute it - we'll be happy to have that. :)

juanitomint commented 12 years ago

mmm i see... so what do you think it will be the best strategy to address this issue? can you explain better, if you have time, which kind of flexibility is needed? (specially the provided by loading all the result set into memory) should I rewrite the DB_results to fit my needs and maintain it as a separate lib? or mod?

narfbg commented 12 years ago

Well, for example - we would never be able to use the result(), result_object(), result_array() and custom_result_object() methods if the whole result set is not already fetched, simply because those methods return arrays containing the result sets.

Also, while next_row(), last_row() might be able to work without prefetching, first_row() needs to be able to go to the very first one and row() allows you to specify a specific row number to be returned. An argument could be made that there are functions like mysql_data_seek() which would help in that case, but most of the supported database platforms don't provide such functionality and an alternative solution must be implemented (and that solution will always involve prefetching). num_rows() also requires an e.g. count($this->result_array) to be made for drivers that don't have such functionality.

Another advantage gained in some cases is (believe it or not) performance. If for some reason you need to get the data set twice (in a different format for example) - returning the already fetched value is always faster than fetching it again.

The best strategy for maximum performance is to always deal directly with the rows one by one (when possible). It's why almost no DB vendor has provided a way to fetch the whole result set with just one function/method call (I personally only recall something like this available with SQLite3). So, basically what you need (and that would fit into CodeIgniter) is a method that fetches the next row in the desired form (array or object) and directly returns it, without doing anything else with it. And as I mentioned - unbuffered_row() is a suitable name. ;) And of course - it would need to be properly documented, so that there's no confusion around it's purpose.

juanitomint commented 12 years ago

ok great! got it now, i will go for unbuffered_row() then..., expect a pull request soon!

narfbg commented 12 years ago

Okay, I'll be waiting for it. :) Closing this one.

Mr-Faizan commented 5 years ago

Thank you so much @narfbg your solution perfectly works for me. I am facing memory exhaust issue and i am searching from one week. Finally this works for me. $this->db->save_queries = FALSE; Although you gave this solution in 2012 but still this is helpful :)