JeroenDeDauw / Replicator

CLI tool for importing entities from Wikidata / Wikibase
https://wikibase.consulting
Other
23 stars 6 forks source link

Accented characters causing QuantityValue error #10

Closed beerinfinity closed 7 years ago

beerinfinity commented 7 years ago

When attempting to import Entities that include accented characters the insert into the Query Store fails e.g. Beyren-lès-Sierck

php replicator import:api -vvv Q21883

Importing entity 1: Q21883...
    * Deserializing... done (16.26 ms).
    * Inserting into Dump store... done (45.43 ms).
    * Inserting into Term store... done (140.52 ms).
    * Inserting into Query store... FAILED!
     Error details: Value is not a QuantityValue.

yet when importing Entities without accented characters the insert into the Query Store works fine e.g. Malpighiales

php replicator import:api -vvv Q21887

Importing entity 1: Q21887...
    * Deserializing... done (16.08 ms).
    * Inserting into Dump store... done (57.53 ms).
    * Inserting into Term store... done (231.62 ms).
    * Inserting into Query store... done (90.93 ms).
     Entity imported.

This happens for multiple examples with the QuantityValue error appearing to be thrown as an InvalidArgumentException within the following file:

'vendor/jeroen/query-engine/src/SQLStore/DVHandler/QuantityHandler.php'

The same happens when processing a JSON dump file.

Any help or solutions would be much appreciated.

JeroenDeDauw commented 7 years ago

What are you trying to achieve by using Replicator? The "Query store" never really got finished and is not currently maintained, so unless you really want to use it, I would suggest disabling it.

beerinfinity commented 7 years ago

Jeroen,

Originally I was trying to populate a local instance of Wikidata built on Mediawiki to run SparQL queries against in order to extract data because when running the queries against live Wikidata they frequently timeout when trying to either pull back large volumes of data (more than a few hundred rows) or if they are complex.

I found a tool that extracts data and will load a local instance but encountered two problems:

  1. To do this for a large amount of data would take days or even weeks and would put a huge load on the Wikidata site.
  2. When populating Wikidata it generated its own Q values rather than using the same ones from Wikidata.

I contacted the author about #1 to see if it could run from JSON dumps but was told no.

As there seems to be no tools available to do the above properly I have since been looking at alternatives to populate tables within MySQL which I would then run SQL queries against either standalone or manipulated via PHP and Replicator seemed a good choice but the failure rate on loading due to these characters was up around 50% on initial runs hence why I raised the issue with Query Store.

You mention disabling Query Store - presumably that is a flag somewhere?

Do you have any suggestions for other tools that might help me out? I have seen CouchDB and an associated loader that I may take a look at.

Many thanks in advance.

Best wishes,

Phil

On 02/28/2017 11:48 AM, Jeroen De Dauw wrote:

What are you trying to achieve by using Replicator? The "Query store" never really got finished and is not currently maintained, so unless you really want to use it, I would suggest disabling it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JeroenDeDauw/Replicator/issues/10#issuecomment-283095525, or mute the thread https://github.com/notifications/unsubscribe-auth/ANk9LZV25bdS9NgtALK8Y2FEF2GsUHcOks5rhE_CgaJpZM4MNh1B.

JeroenDeDauw commented 7 years ago

The tool has no flags on the CLI level for the "import targets" at present. So you will need to make a minute change to the PHP.

At the beginning of this line, put //, so it looks like this:

            [
                // new EntityStoreEntityHandler( $this->newEntityStore() )
            ],

That should do it... not tried it and been a while since I poked at this, so let me know if not.

Also see "Import targets" in https://www.entropywins.wtf/blog/2016/01/25/replicator-a-cli-tool-for-wikidata/. If you're interested in having the tool write to something else, such as CouchDB or some sparql store, that should be easy to do, at least on the side of Replicator.

JeroenDeDauw commented 7 years ago

Also note that the API client library the tool currently uses is a bit out of date. That might cause some errors for newer formats of data in Wikidata... I'll have a look at that soon and upgrade the stuff.

Which PHP version are you using?

beerinfinity commented 7 years ago

I am using PHP V7.0 - many thanks for the feedback. I will have a dig around using your suggestions and see what I can get working.

JeroenDeDauw commented 7 years ago

I'm updating the thing now to work with recent Wikibase data. Travis is having issues so it will be some hours before this is merged https://github.com/JeroenDeDauw/Replicator/pull/11

JeroenDeDauw commented 7 years ago

Oh and, that PR disables the query engine thing by default so you'd not need to comment out the stuff like I suggested.

JeroenDeDauw commented 7 years ago
ubuntu@ubuntu-xenial:/vagrant$ php replicator import:api -vvv Q21883

Importing entity 1: Q21883...
        * Deserializing... done (98.71 ms).
        * Inserting into Dump store... done (66.86 ms).
        * Inserting into Term store... done (15.62 ms).
         Entity imported.

Import stats:
Entities: 1 (1 succeeded, 0 (0%) failed)
Duration: 0.566293 seconds (1 entities/second)

PS: Vagrant is now supported out of the box

beerinfinity commented 7 years ago

Many thanks, I will test this tomorrow.

beerinfinity commented 7 years ago

Downloaded the latest JSON dump for wikidata and ran the new version, still getting errors (I added a bit extra to the 'throw new InvalidArgumentException' statements to show the actual values):

Importing entity 1: Q22...
    * Deserializing... done (27.34 ms).
    * Inserting into Dump store... done (195.45 ms).
    * Inserting into Term store... done (82.24 ms).
    * Inserting into Query store... FAILED!
     Error details: Value is not a QuantityValue. value = +78782http://www.wikidata.org/entity/Q712226

Importing entity 2: Q31...
    * Deserializing... done (93.03 ms).
    * Inserting into Dump store... done (791.07 ms).
    * Inserting into Term store... done (354.32 ms).
    * Inserting into Query store... FAILED!
     Error details: Units other than "1" are not yet supported. value->getUnit() = http://www.wikidata.org/entity/Q4917

Importing entity 3: Q1...
    * Deserializing... done (14.69 ms).
    * Inserting into Dump store... done (181.27 ms).
    * Inserting into Term store... done (103.32 ms).
    * Inserting into Query store... FAILED!
     Error details: Value is not a QuantityValue. value = +880000000000000000000000http://www.wikidata.org/entity/Q828224

Importing entity 4: Q13...
    * Deserializing... done (4.53 ms).
    * Inserting into Dump store... done (151.18 ms).
    * Inserting into Term store... done (50.5 ms).
    * Inserting into Query store... done (2.95 ms).
     Entity imported.

Importing entity 5: Q23...
    * Deserializing... done (39.78 ms).
    * Inserting into Dump store... done (253.5 ms).
    * Inserting into Term store... done (247.91 ms).
    * Inserting into Query store... FAILED!
     Error details: Value is not a QuantityValue. value = +0

Import stats:
Entities: 5 (1 succeeded, 4 (80%) failed)
Duration: 2.86479 seconds (1 entities/second)

Seems there are two issues within 'vendor/query-engine/src/SQLStore/DVHandler/QuantityHandler.php'. Going by the Entities these are not limited to accented characters so may be a different issue to the one above or perhaps the accented characters was a red-herring.

JeroenDeDauw commented 7 years ago

The new version does not use the query store, which is where you are getting errors from. So either you are running the old version, or you enabled the query store somehow. In case of the later, the error is not surprising, since I did not test or update that code. Do you want to make use of it, and if so, why?

The tool outputs it's version. If you got the new stuff, it should say 0.2-dev at the start of each run.

beerinfinity commented 7 years ago

I ran 'composer update' prior to using Replicator, is there something else I need to do to complete the update?

beerinfinity commented 7 years ago

ignore that - I need to pull down the new source... late Friday, sorry

JeroenDeDauw commented 7 years ago
git pull
composer update
beerinfinity commented 7 years ago

It now runs without error - many thanks. Non-ASCII characters are having problems being stored in the 'ts_aliases' & 'ts_labels' tables but I will look into this. Many thanks once again.

JeroenDeDauw commented 7 years ago

Those tables and the code that writes to them are defined in https://github.com/JeroenDeDauw/TermStore