flow-php / flow

Flow PHP - data processing framework
https://flow-php.com
MIT License
478 stars 28 forks source link

Allowed memory size exhausted #1044

Closed ITspirit closed 6 months ago

ITspirit commented 7 months ago

Hello,

i am new at flow-php/etl and give it a try, because i had very very big CSV Files to import into database.

Till yet i import this via "LOAD DATA INFILE" - but this have some problems if one row is "buggy" it throw awawy the complet chunk.

With flow-php i ran with one of the smaller files already in a "Allowed memory size exhausted" Error. :-(

What i am doing wrong:

            data_frame()
                ->read(from_csv('my.csv',
                    with_header: false
                ))
                ->rename("e00", "country")
                ->rename("e01", "postal_code")
                ->rename("e02", "name")
                ->rename("e03", "admin1_name")
                ->rename("e04", "admin1_code")
                ->rename("e05", "admin2_name")
                ->rename("e06", "admin2_code")
                ->rename("e07", "admin3_name")
                ->rename("e08", "admin3_code")
                ->rename("e09", "latitude")
                ->rename("e10", "longitude")
                ->rename("e11", "accuracy")
                ->batchSize(25000)
                ->load(to_dbal_table_insert($this->db, 'my_table_name'))
                ->run();

Batch Size is choosen low - but still break the first batch.

Is there a possibilite to "stream" or chunk the read from_csv?

Had the hope that flow-php will take care about that.

Any help is really appreciated, because i can´t find something suitable in th Documentation :/

norberttech commented 7 months ago

Hey @ITspirit It seems that you have two issues, lets try to approach them one by one:

Memory Issue

So before we move to a solution, lets make sure we are on the same page about some core concepts. Do you have any specific memory limit in mind? As you probably know, in php memory limit can be set for example through php.ini file.

If you run following command in the CLI:

$ php -i | grep "memory_limit"

The output should tell you how much memory PHP can use. If you are getting -1 it means that php is not limited in anyway and it would try to use all available memory.

Ok, so now when we know how to check/adjust memory limit, lets try to look at your problem.

For the sake of this conversation lets assume that you want to allocate up to 50Mb of memory to process that CSV file. If the CSV file is bigger than 50Mb (lets say 1Gb), processing of the file will fail.

So our goal is to make sure that the size of rows that flow keeps in memory is lower than your memory limit (+/- 10% that could be used by other parts of your php script unrelated to flow, like bootstraping).

Now the batchSize defines the number of rows that Flow will keep in memory but also how many rows at time Loader will try to save to the destination at once.

->batchSize(25000)

This pretty much means that Flow is going to read 25k of rows from that file (size of each row might be different, depends on your data) but also that it will try to make an DB insert with 25k rows at once.

I'm not sure how many rows you are trying to process in general but 25k sounds like a lot for a one batch (unless you have unlimited memory).

So to summarise:

Broken Rows

but this have some problems if one row is "buggy" it throw awawy the complet chunk.

This one I'm not sure if I fully understand, could you please provide me few examples of how good and bad rows looks like? I don't need to see real data, it's more about separators and enclousers.

ITspirit commented 7 months ago

Hey @norberttech,

thx for your fast reply.

Regarding Memory Issue

Well, regarding Memory Issue, i know how PHP will work. Already set the Memory Limit to 512 MB (more makes less sense, unless i will use a dedicated import only instance/pool of php - which i currently don´t plan. So i have to deal with 512 MB

With a smaller batchSize - the import would took a very very long time. But ok, understand what you mean.

But even with a ->batchSize(5000) i run into:

In ExpandArrayParameters.php line 76:
Error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 1921024 bytes)  

Example

We can try with a real world example - no secret data. I will try to import these file:

https://download.geonames.org/export/zip/allCountries.zip

The file size is sometimes very big, that´s why i split (after unzip) such big files with:

split -d -l 500000 allCountries.txt allCountriesChunk

For the import i´ve created a table:

CREATE TABLE IF NOT EXISTS my_table_name (
              `country` char(2) DEFAULT NULL,
              `postal_code` varchar(20) DEFAULT NULL,
              `name` varchar(180) DEFAULT NULL,
              `admin1_name` varchar(100) DEFAULT NULL,
              `admin1_code` varchar(20) DEFAULT NULL,
              `admin2_name` varchar(100) DEFAULT NULL,
              `admin2_code` varchar(20) DEFAULT NULL,
              `admin3_name` varchar(100) DEFAULT NULL,
              `admin3_code` varchar(20) DEFAULT NULL,
              `latitude` decimal(10,7) DEFAULT NULL,
              `longitude` decimal(10,7) DEFAULT NULL,
              `accuracy` tinyint(2) DEFAULT NULL,
              KEY `admin1_code` (`admin1_code`),
              KEY `country` (`country`),
              KEY `name` (`name`),
              KEY `postal_code` (`postal_code`),
              KEY `latitude` (`latitude`),
              KEY `longitude` (`longitude`),
              KEY `admin1_name` (`admin1_name`),
              KEY `admin1_code_2` (`admin1_code`)
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci;

So, that´s the setup i have.

The Import works fast with

LOAD DATA LOCAL INFILE ? INTO TABLE my_table_name  CHARACTER SET utf8 (country, postal_code, name, admin1_name, admin1_code, admin2_name, admin2_code, admin3_name, admin3_code, @latitude, @longitude, @accuracy) SET accuracy = nullif(@accuracy, ''), latitude = nullif(@latitude, ''), longitude = nullif(@longitude, '')

But i have the need to make it more general, more expandable - because that´s just a simple example. That´s why i want to switch to php-flow/etl

Broken Rows

That happens most of the time with this file:

https://download.geonames.org/export/dump/allCountries.zip (looks similar like the other link, but it´s completely different and much much bigger)

LOAD DATA LOCAL FILE fail on each "splitchunk" with a wrong coding error. Tell me that there is a misscoding of "utf8mb3" but the target table is completely utf8mb4 and the file looks like normal utf8, so i have no clue why some lines are trigger this error. The result will be, that the whole file will be skipped. I had the idea that a import via php-flow/etl will/could solve this issue - maybe.

doctrine-dbal-bulk

Maybe this Adapter could solve my problems a little bit, but tbh, i don´t have a clue how it could work with php-flow/etl or is it complete standalone? Documentation is a little bit unclear at this point.

norberttech commented 7 months ago

Thanks for a very detailed description, thanks to this I was able to fully reproduce your code locally which is super helpful 💪

I'm working with memory_limit=-1 and MySQL instance on docker. I was able to run multiple different examples on a subset of this dataset.

Looking at those values I think we might have a memory leak somewhere since for the same batch size and different limits, memory consumption should be constant. (Of course, without backfire processing time and probably even memory consumption should be much lower but that's irrelevant for now)

I'm gonna need to look into this and try to determine where the memory is leaking, (it could be anywhere between reading and writing, including in dbal), can't promise anything as I'm out for the weekend but it's going to be now one of my top priorities.

Update: Please look at the comment below:

Processing that 115mb file without profiling and batch size 5k looks like this on my machine:

Processing: countries.tsv (115Mb) - batchSize: 5000 - limit: no-limit...
DB Entries count: 1 557 190
Total time: 224.21s

Here is my code (as you can see I'm also using Blackfire to get very detailed profiles):

<?php

use Blackfire\Client;
use Blackfire\Profile\Configuration;
use Doctrine\DBAL\DriverManager;
use function Flow\ETL\Adapter\CSV\from_csv;
use function Flow\ETL\Adapter\Doctrine\to_dbal_table_insert;
use function Flow\ETL\DSL\data_frame;

require __DIR__ . '/../../vendor/autoload.php';
include __DIR__ . '/../../tools/blackfire/vendor/autoload.php';

$connectionParams = [
    'dbname' => 'mysql',
    'user' => 'mysql',
    'password' => 'mysql',
    'host' => '127.0.0.1',
    'port' => 3307,
    'driver' => 'pdo_mysql',
];
$connection = DriverManager::getConnection($connectionParams);

$connection->executeQuery(
    <<<SQL
CREATE TABLE IF NOT EXISTS countries (
              `country` char(2) DEFAULT NULL,
              `postal_code` varchar(20) DEFAULT NULL,
              `name` varchar(180) DEFAULT NULL,
              `admin1_name` varchar(100) DEFAULT NULL,
              `admin1_code` varchar(20) DEFAULT NULL,
              `admin2_name` varchar(100) DEFAULT NULL,
              `admin2_code` varchar(20) DEFAULT NULL,
              `admin3_name` varchar(100) DEFAULT NULL,
              `admin3_code` varchar(20) DEFAULT NULL,
              `latitude` decimal(10,7) DEFAULT NULL,
              `longitude` decimal(10,7) DEFAULT NULL,
              `accuracy` tinyint(2) DEFAULT NULL,
              KEY `admin1_code` (`admin1_code`),
              KEY `country` (`country`),
              KEY `name` (`name`),
              KEY `postal_code` (`postal_code`),
              KEY `latitude` (`latitude`),
              KEY `longitude` (`longitude`),
              KEY `admin1_name` (`admin1_name`),
              KEY `admin1_code_2` (`admin1_code`)
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci;
SQL);
$connection->executeQuery('TRUNCATE TABLE countries');

$blackfire = new Client();

$batchSize = 5000;
$limit = 25000;

$size = round(\filesize(__DIR__ . '/countries.tsv') / 1024 / 1024) . "Mb";

echo "Processing: countries.tsv ($size) - batchSize: $batchSize - limit: $limit...\n";

$probe = $blackfire->createProbe(
    (new Configuration())
        ->setTitle("countries.tsv ($size) - batchSize: $batchSize - limit: $limit"),
    false
);

$probe->enable();

data_frame()
    ->read(from_csv(__DIR__ . '/countries.tsv',
        with_header: false
    ))
    ->rename("e00", "country")
    ->rename("e01", "postal_code")
    ->rename("e02", "name")
    ->rename("e03", "admin1_name")
    ->rename("e04", "admin1_code")
    ->rename("e05", "admin2_name")
    ->rename("e06", "admin2_code")
    ->rename("e07", "admin3_name")
    ->rename("e08", "admin3_code")
    ->rename("e09", "latitude")
    ->rename("e10", "longitude")
    ->rename("e11", "accuracy")
    ->limit($limit)
    ->batchSize($batchSize)
    ->load(to_dbal_table_insert($connection, 'countries'))
    ->run();

$profile = $blackfire->endProbe($probe);

echo "Profile url: " . $profile->getUrl() . "\n";
echo "DB Entries count: " . (int) $connection->fetchOne('SELECT COUNT(*) FROM countries') . "\n";

Abut dbal bulk, yes, it's perfectly usable as a standalone library. Example:

   Bulk::create()->insert(
                $this->connection,
                'table_name',
                new BulkData([
                    [...],
                    [...],
                    [...]
                 ]),
            );
norberttech commented 7 months ago

Ok, I spent more time on this, and I think I jumped too quickly to the conclusion that there is a memory leak somewhere in the code. Blackfire is indeed showing increased memory consumption based on the limit but I believe it's actually more related to how Backfire injects itself into the process. I did another test by adding a callback to run the method:

    ->run(static function (Rows $rows) : void {
        echo \json_encode([
                'usage' => round(memory_get_usage(true) / 1024 / 1024) . "Mb",
                'peak' => round(memory_get_peak_usage(true) / 1024 / 1024) . "Mb",
            ]) . "\n";
    });

This test gave me very stable memory consumption regardless of the limit. You might want to try the same thing.

Another thing I did since I'm using MacOS is check the Activity Monitor, which is showing me pretty similar results to code above which also confirms that memory is not leaking.

Could you put similar callback in your code and share the output with me?

norberttech commented 7 months ago

also another question came to my mind just now, what operating system are you using?

ITspirit commented 7 months ago

Hello @norberttech ,

i working on Ubuntu 22 - but everything is running (php and mysql) in Docker containers.

I run it again with your callback. Full file, with batch size 5000.

So you see, usage and peak is constantly growing.

I set the memory_limit to "-1" - so it finished, but took very very long and need hughe amount of memory. With much bigger files, i will run for sure into limit of the Host Machine. :/

... extracting archive: /var/tmp/geodata/allCountries.zip
...... finished extracting
... executing command: split -d -l 500000 allCountries.txt allCountriesChunk
... importing file: /var/tmp/geodata/allCountriesChunk00 >>>> bundle_geodata_postalcodes
... TRUNCATE bundle_geodata_postalcodes
{"usage":"125Mb","peak":"127Mb"}
{"usage":"143Mb","peak":"145Mb"}
{"usage":"145Mb","peak":"152Mb"}
{"usage":"149Mb","peak":"161Mb"}
{"usage":"155Mb","peak":"165Mb"}
{"usage":"163Mb","peak":"173Mb"}
{"usage":"167Mb","peak":"179Mb"}
{"usage":"171Mb","peak":"183Mb"}
{"usage":"175Mb","peak":"187Mb"}
{"usage":"179Mb","peak":"191Mb"}
{"usage":"183Mb","peak":"195Mb"}
{"usage":"187Mb","peak":"199Mb"}
{"usage":"191Mb","peak":"203Mb"}
{"usage":"195Mb","peak":"207Mb"}
{"usage":"199Mb","peak":"211Mb"}
{"usage":"203Mb","peak":"215Mb"}
{"usage":"207Mb","peak":"219Mb"}
{"usage":"211Mb","peak":"223Mb"}
{"usage":"215Mb","peak":"227Mb"}
{"usage":"219Mb","peak":"231Mb"}
{"usage":"223Mb","peak":"235Mb"}
{"usage":"227Mb","peak":"239Mb"}
{"usage":"231Mb","peak":"243Mb"}
.
.
.
{"usage":"1343Mb","peak":"1355Mb"}
{"usage":"1347Mb","peak":"1359Mb"}
{"usage":"1351Mb","peak":"1363Mb"}
{"usage":"1355Mb","peak":"1367Mb"}
{"usage":"1359Mb","peak":"1371Mb"}
{"usage":"1363Mb","peak":"1375Mb"}
{"usage":"1367Mb","peak":"1379Mb"}
{"usage":"1371Mb","peak":"1383Mb"}
{"usage":"1375Mb","peak":"1387Mb"}
{"usage":"1379Mb","peak":"1391Mb"}
{"usage":"1383Mb","peak":"1395Mb"}
{"usage":"1383Mb","peak":"1395Mb"}
norberttech commented 7 months ago

That's super weird, I spent some time yesterday playing with the code I shared above and https://github.com/arnaud-lb/php-memory-profiler but I didn't find anything significant.

Here is the output from my script:

Processing (60059): countries.tsv (115Mb) - batchSize: 5000 - limit: no-limit...
{"usage":"66Mb","peak":"77Mb"}
{"usage":"84Mb","peak":"93Mb"}
{"usage":"84Mb","peak":"99Mb"}
{"usage":"84Mb","peak":"101Mb"}
{"usage":"86Mb","peak":"101Mb"}
{"usage":"86Mb","peak":"103Mb"}
{"usage":"86Mb","peak":"103Mb"}
{"usage":"86Mb","peak":"103Mb"}
{"usage":"86Mb","peak":"103Mb"}
{"usage":"88Mb","peak":"103Mb"}
{"usage":"88Mb","peak":"103Mb"}
{"usage":"88Mb","peak":"105Mb"}
{"usage":"88Mb","peak":"105Mb"}
{"usage":"88Mb","peak":"105Mb"}
{"usage":"88Mb","peak":"105Mb"}
....
{"usage":"92Mb","peak":"109Mb"}
{"usage":"92Mb","peak":"109Mb"}
{"usage":"92Mb","peak":"109Mb"}
{"usage":"92Mb","peak":"109Mb"}
Finished 
DB Entries count: 1557190
Time: 186.19s

It is slightly increasing memory over time, but not as bad as for you. Do you have any other code before/after the ETL pipeline? If yes, could you try to extract ETL pipeline to a standalone script, that does nothing more than just processing the CSV file load it into DB, and check if the memory consumption is growing equally fast? Could you also try to run the same script I shared above on the same dataset to check if we are getting similar results?

Also what extensions do you have enabled? Type php -m in console.

I'm working on MacOS M1, but I might need to set up a virtual machine to try to reproduce that issue.

norberttech commented 7 months ago

Oh and one more thing that might be relevant, could you try to replace:

    ->write(to_dbal_table_insert($connection, 'my_table'))

with

    ->saveMode(overwrite())
    ->load(to_csv(__DIR__ . '/output_countries.csv'))

I'm suspecting dbal configuration to cause the memory leaks, like for example some in memory logger

ITspirit commented 7 months ago

Hello again,

ok - with to_csv i have constant:

{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}
{"usage":"125Mb","peak":"125Mb"}

So, yes - looks like the dbal configuration/setup is responsible for the memory consumption. :/

My Command run in Context of a Symfony Application (Pimcore) with Doctrine.

Docker Images in use: mysql:8.0.31 pimcore/pimcore:php8.2-debug-v3

This PHP Version has a lot of extension enabled:

[PHP Modules]
apcu
bcmath
Core
ctype
curl
date
dom
exif
fileinfo
filter
gd
hash
iconv
imagick
intl
json
libxml
mbstring
mysqlnd
openssl
pcntl
pcre
PDO
pdo_mysql
pdo_sqlite
Phar
posix
random
readline
redis
Reflection
session
SimpleXML
sockets
sodium
SPL
sqlite3
standard
tokenizer
xdebug
xml
xmlreader
xmlwriter
Zend OPcache
zip
zlib

[Zend Modules]
Xdebug
Zend OPcache

Will try to figure out more tomorrow at work. :-)

norberttech commented 7 months ago

hey @ITspirit did you manage to figure out what's wrong here?

ITspirit commented 7 months ago

Not yet. Some Business "Fire" prevent to go forward on this topic this week. Will be out of the office till Sunday, so look into it again after i am back in office.

norberttech commented 6 months ago

Sounds good, since this issue is most likely related to your development environment, not Flow I'm closing it for now. However, If you manage to find any evidence that memory leaks are somehow caused by Flow, please let me know and I will reopen the investigation.