Wikidata / Wikidata-Toolkit

Java library to interact with Wikibase
https://www.mediawiki.org/wiki/Wikidata_Toolkit
Apache License 2.0
373 stars 100 forks source link

Duplicate entries in item document #564

Closed jjkoehorst closed 3 years ago

jjkoehorst commented 3 years ago

When testing on a local Wikibase instance I was adding property information to an item using the wbde.updateStatements and later on using the wbde.updateTermsStatements. Using either method it will add the new statement to the instance even if this statement with this exact value is already stored on this item page.

Is there a way to automagically merge duplicate entries and when adding a new property / value when that property already exists with the same value that it will merge the two as in not add the statement as it already exists?

Also according to the functions tested it states that

Updates the terms and statements of the item document identified by the given item id. The updates are computed with respect to the current data found online, making sure that no redundant deletions or duplicate insertions happen. The references of duplicate statements will be merged. The labels and aliases in a given language are kept distinct.

indicating that this duplication issue should not occur perhaps I am missing an addition argument?

// Test 1
             ItemDocument itemDocument = wbde.updateStatements(itemIdValue, statements, Collections.emptyList(),"Genome information update", null);
// Test 2
            wbde.updateTermsStatements(itemIdValue, Collections.emptyList(), Collections.emptyList(), Collections.emptyList(), Collections.emptyList(), statements, Collections.emptyList(), "Genome sync update", Collections.emptyList());
wetneb commented 3 years ago

Hi Jasper, Normally the merge functionality described in the javadoc should still work - that's something people use routinely in OpenRefine. I suspect the problem you have could be due to various things:

  1. The value you might be pushing might be visibly identical to the one stored in Wikibase, but there might be subtle differences that are not shown in the UI (for instance, the calendar URI for a date, the upper/lower bounds for a quantity, some invisible character for a string). It would be good to know for which datatype you are witnessing this problem.
  2. There might be some issues that are specific to the use of a third-party Wikibase instance (i.e. not Wikidata) - perhaps the comparison between the data values fails because their "siteIRI" is mismatching. How do you create the data values you are pushing to Wikidata? With which siteIRI? Is it the same as the one you pass to the editing module?

Note that more generally I would really like to work on this merging functionality and make it much more configurable: #403 (but that's only relevant to you if the issue is due to point 1 above).

jjkoehorst commented 3 years ago

I am indeed using a wikibase instance that runs locally so indeed not wiki data. Is there a way to point all references to the local instance instead of wiki data?

wetneb commented 3 years ago

How do you create the data values you are pushing to Wikidata? With which siteIRI? Is it the same as the one you pass to the editing module?

jjkoehorst commented 3 years ago

I am pushing nothing to Wikidata the authentication is performed with a local instance.

jjkoehorst commented 3 years ago

This is what I use to authenticate:


        basicApiConnection = new BasicApiConnection(commandOptions.wiki);
        basicApiConnection.login(commandOptions.username,commandOptions.password);
        wbde = new WikibaseDataEditor(App.basicApiConnection, App.commandOptions.wiki);
        wbdf = new WikibaseDataFetcher(App.basicApiConnection, App.commandOptions.wiki);
        wbea = new WbEditingAction(App.basicApiConnection, App.commandOptions.wiki);
        wbde.setEditAsBot(true);
        wbde.setAverageTimePerEdit(1);

and the wiki is a "-wiki", "http://localhost:8181/w/api.php", path

wetneb commented 3 years ago

Ah sorry, I mean how do you create the data values you are pushing to Wikibase? As in, how do you generate the edits?

jjkoehorst commented 3 years ago

Ah clear, so I have a item identifier as a string

        ItemIdValue itemIdValue = Datamodel.makeWikidataItemIdValue(item);
// Empty list of statements
        ArrayList<Statement> statements = new ArrayList<>();
// Statement with creation function
        statements.add(Statements.create(itemIdValue, "genome size", BigDecimal.valueOf(genome.getGenomeSize()), wdPageLookup.get("European Nucleotide Archive")));

Statement creation function which returns the statement with a big decimal value and a stated in element

    public static Statement create(ItemIdValue itemIdValue, String propertyID, BigDecimal value, String referencePage) {
        // Get property value of genome size
        PropertyIdValue genomeSizeProperty = Datamodel.makeWikidataPropertyIdValue(propertyLookup.get(propertyID).replaceAll(".*/", ""));
        // Set value of genome size to a quantity
        QuantityValue genomeSizeValue = Datamodel.makeQuantityValue(value);
        // Create the reference
        String refPropertyID = propertyLookup.get("stated in").replaceAll(".*/", "");
        PropertyIdValue statedIn = Datamodel.makePropertyIdValue(refPropertyID, App.commandOptions.wiki);
        Reference reference = ReferenceBuilder.newInstance().withPropertyValue(statedIn, Datamodel.makeItemIdValue(referencePage, App.commandOptions.wiki)).build();
        // Create the genome size statement with reference
        Statement statement = StatementBuilder.forSubjectAndProperty(itemIdValue, genomeSizeProperty).withValue(genomeSizeValue).withReference(reference).build(); // .withReference(reference)

        // Return the statement
        return statement;
    }

Hope this helps a bit?

I have just synced it to my repository and the statements.add can be found at https://gitlab.com/wurssb/wikigroup/wikistrains/-/blob/master/botjava/src/main/java/nl/munlock/wiki/sync/genetics/GenomeSync.java 72,73

wetneb commented 3 years ago

See, the problem is with your first line:

ItemIdValue itemIdValue = Datamodel.makeWikidataItemIdValue(item);

This creates a Wikidata item id value, not one for your Wikibase instance. You should use makeItemIdValue and supply a siteIRI that identifies your instance (and reuse the same one everywhere else). We should probably have some checks to detect that and raise an exception properly when you try to push that to your instance.

(and you'll have to check that you are not calling any Wikidata-specific methods elsewhere, for instance Datamodel.makeWikidataPropertyIdValue)

jjkoehorst commented 3 years ago

Thanks a lot this works great! Had multiple occurrences of Wikidata in the code so that is now cleaned up.