Josiah / JJsGeonamesBundle

GeoNames.org geographical data and associated functionality
18 stars 15 forks source link

Slow Import #2

Open lunetics opened 10 years ago

lunetics commented 10 years ago

Tried a little bit around. Shouldn't it be possible to import / save in smaller chunks, so that the entitymanager could be cleared all xy parsed entries? should speed the import up.

Any Idea?

Josiah commented 10 years ago

@lunetics actually, when the import is in smaller chunks it actually takes longer. I originally had the import happening at every 100 entries however it took much longer to perform the overall import than the 'all at once' approach.

By using a huge block we trade a large memory footprint to gain speed in the rebuilding of db indexes and performing the imports as a single transaction.

lunetics commented 10 years ago

I almost managed to get it down by almost the half using small batched insert with detached/ clear unused / old entities

Without detach / clear

geonames:load:localities --no-debug -env=prod -v AF AF (Afghanistan) data saved Imported in 59.005122 seconds.

With detach / clear

geonames:load:localities --no-debug -env=prod -v AF AF (Afghanistan) data saved Imported in 39.573471 seconds.

Also there is not (unique) index on geonames_id column in mysql, adding that helps alot, as the import will slow down since there's an select on the id for every entry.

I just added this piece of code right before each iteration in while here:

https://github.com/Josiah/JJsGeonamesBundle/blob/master/Import/LocalityImporter.php#L619

                'repository'   => get_class($localityRepository),
            ]);

            if ($lineNumber % 200 == 0) {
                foreach ($managers as $manager) {
                    $manager->flush();
                    foreach ($entities as $entity) {
                        $manager->detach($entity);
                        $manager->clear($entity);
                        unset($entity);
                    }
                }

                unset($entities);
            }
        }
Josiah commented 10 years ago

Interesting, I guess that I was wrong!

Can you submit a PR? I'll merge it straight away.

lunetics commented 10 years ago

Still working on it, looking to improve this already great bundle a little bit more. Still you are very advanced and i still need to understand how your structuring of repositories works :)

The other way could be to load the file directly in mysql raw via INFILE and process / link the entities afterwards (load infile is awesome fast)

Also i look to load the alternate names into the database and having some unified way to interact with geonames_id's