JeroenDeDauw / Replicator

CLI tool for importing entities from Wikidata / Wikibase
https://wikibase.consulting
Other
23 stars 6 forks source link

Add continuation support for gz import #7

Closed addshore closed 8 years ago

addshore commented 8 years ago

Per #6 this one is also needed.

I am currently trying to import all wikibase entities and I am on my third attempt of the import.

JeroenDeDauw commented 8 years ago

Any errors? Why did you have to restart?

addshore commented 8 years ago

The first time was due to having to stop the import and restart it (no errors here) The second time the process stopped (althought I dont know why) as it was running in a screen that autoclosed facepalm

JeroenDeDauw commented 8 years ago

Right. It's quite easy to add an offset parameter that will have the persistence work skipped, though deserialization of entities will still happen. I don't know how the cost of these compares, so am not sure it would make much of a difference.

https://github.com/JeroenDeDauw/Replicator/blob/125b2a5ed3b749215b3dccd76deb37ac098ced22/src/Cli/Command/GzJsonImportCommand.php#L74 (only need changes in this one file)

addshore commented 8 years ago

Well, I imagine this would still speed up getting to the point you left off (no dB transactions!) Are entities not deserialized one at a time? Meaning you could count before deserializing?

JeroenDeDauw commented 8 years ago

They are deserialized one at a time, though if you use this Entity object level iterator, that is hidden from you. You could instead use the string level one and do the deserialization yourself.

This actually suggests it might be good to not hide the fact that the higher level iterators are based on lower level ones in https://github.com/JeroenDeDauw/JsonDumpReader/blob/master/src/JsonDumpFactory.php#L105. Could add two new factory functions there. See https://github.com/JeroenDeDauw/JsonDumpReader/issues/3

JeroenDeDauw commented 8 years ago

The gz stuff actually supports file seek, unlike bz2. Now supported by JsonDumpReader: https://github.com/JeroenDeDauw/JsonDumpReader/commit/9e57c3c0ac5896d4db7365c074c9ae7048236b41

JeroenDeDauw commented 8 years ago

920f03be59c2d15e7342d731409f5c5d8ca2da87