Wikidata / Wikidata-Toolkit

Java library to interact with Wikibase
https://www.mediawiki.org/wiki/Wikidata_Toolkit
Apache License 2.0
372 stars 100 forks source link

[discussion] new architecture for Wikibase editing #403

Closed wetneb closed 3 years ago

wetneb commented 5 years ago

Wikidata-Toolkit's editing features are nice but fairly limited at the moment. Arbitrary changes can be made to items and properties, but few API actions are supported (which means that the edit summaries are not as meaningful as they should be). And now that lexemes are around, we should also allow editing for these.

My PR #323 was one step in this direction but I don't think it will scale very well. I think it is time to step back and redesign the architecture of this editing module.

As a user, at the moment, the go-to entry point for editing is WikibaseDataEditor, and more specifically methods like this: https://github.com/Wikidata/Wikidata-Toolkit/blob/8cd4fbb5e3a9014d06e84bd8ea2e2817e8146aea/wdtk-wikibaseapi/src/main/java/org/wikidata/wdtk/wikibaseapi/WikibaseDataEditor.java#L437-L444

Beyond the lengthy arguments list, this isn't exactly nice to work with. This abstracts away in one single method a lot of different things: all the merging logic (how do you match the added statements to the existing ones on the item?) and the uploading logic (which API actions do you use to actually upload the changes?). Both of these things should be configurable:

There are various objects involved in this process:

This is quite a lot of work, but once we have this, I think it should give a really nice editing experience. These changes of architecture can be made in a completely non-breaking way (it is easy to maintain the current interface of WikibaseDataEditor and propose a more granular access in parallel).

I am obviously writing all this with OpenRefine in mind - and I can very well implement this architecture in OpenRefine directly, but I think this is really generic and ought to be available in WDTK.

Tpt commented 5 years ago

Thank you for this proposal. It looks very promising!

A maybe relevant thing to look at are the Wikibase extension ChangeOp classes that are responsible of validating and applying changes to entities in the API implementation: https://github.com/wikimedia/mediawiki-extensions-Wikibase/tree/master/repo/includes/ChangeOp

wetneb commented 5 years ago

@Tpt makes sense, as the library gets more precise it should eventually converge towards the reference implementation indeed.

The structure of the merging strategies is still not entirely clear to me so I might try to design it in the scope of OR, by keeping the classes independent from the rest of the code base so that they can be easily reused (once we have a better understanding of how these things should work).

Tpt commented 5 years ago

The structure of the merging strategies is still not entirely clear to me so I might try to design it in the scope of OR

That looks like a good idea to me. And when you are going to be happy with the design we could move them to WDTK.

Tpt commented 3 years ago

Wikimedia Germany is planning to implement a REST API for Wikibase: https://www.wikidata.org/wiki/Wikidata:REST_API It might make the WDTK editing part much easier

robertvazan commented 3 years ago

@Tpt @wetneb I have successfully integrated WDTK in my code and added a layer of extension code to cover missing functionality. Now I am looking for ways to upstream all these extensions. I am considering an API similar to the one proposed above, so I am posting this here.

Every edit would start with a builder:

public abstract class EntityUpdateBuilder {
    public static EntityUpdateBuilder fromId(EntityIdValue id) { /* ... */ }
    public static EntityUpdateBuilder fromDocument(EntityDocument document) { /* ... */ }
    public abstract EntityUpdate build() { /* ... */ }
}
public abstract class StatementUpdateBuilder extends EntityUpdateBuilder {
    public static StatementUpdateBuilder fromId(EntityIdValue id) { /* ... */ }
    public static StatementUpdateBuilder fromDocument(StatementDocument document) { /* ... */ }
    public void addStatement(Statement statement) { /* ... */ }
    public void replaceStatement(Statement statement) { /* ... */ }
    public void removeStatement(String id) { /* ... */ }
    @Override public abstract StatementUpdate build() { /* ... */ }
}
public class LexemeUpdateBuilder extends StatementUpdateBuilder {
    public static LexemeUpdateBuilder fromId(LexemeIdValue id) { /* ... */ }
    public static LexemeUpdateBuilder fromDocument(LexemeDocument document) { /* ... */ }
    public void setLanguage(ItemIdValue language) { /* ... */ }
    public void setLexicalCategory(ItemIdValue category) { /* ... */ }
    public void setLemma(MonolingualTextValue lemma) { /* ... */ }
    public void removeLemma(String languageCode) { /* ... */ }
    public void addForm(FormDocument form) { /* ... */ }
    public void updateForm(FormUpdate update) { /* ... */ }
    public void removeForm(FormIdValue id) { /* ... */ }
    @Override public LexemeUpdate build() { /* ... */ }
}

I am omitting a lot of the API for brevity (items, senses, properties). This basic API can be extended with convenience methods and fluent chaining, but I want to focus on more important design issues here:

Method build() then creates update object:

public interface EntityUpdate {
    EntityIdValue getEntityId();
    EntityDocument getCurrentDocument();
}
public interface StatementUpdate extends EntityUpdate { // not to be confused with current StatementUpdate
    @Override StatementDocument getCurrentDocument();
    List<Statement> getAddedStatements();
    Map<String, Statement> getReplacedStatements();
    Set<String> getRemovedStatements();
}
public interface LexemeUpdate extends StatementUpdate {
    @Override LexemeIdValue getEntityId();
    @Override LexemeDocument getCurrentDocument();
    Optional<ItemIdValue> getLanguage();
    Optional<ItemIdValue> getLexicalCategory();
    Map<String, MonolingualTextValue> getAssignedLemmas();
    Set<String> getRemovedLemmas();
    List<FormDocument> getAddedForms();
    Map<FormIdValue, FormUpdate> getUpdatedForms();
    Set<FormIdValue> getRemovedForms();
}

These objects are only useful for inspection of the planned changes. They have corresponding LexemeUpdateImpl and similar classes that implement JSON serialization as expected by wbeditentity. Optional is returned by some methods per #327.

Finally, WikibaseDataEditor gets new methods:

public class WikibaseDataEditor {
    public LexemeIdValue createNewLexemeDocument(
            LexemeDocument document, boolean clear, String summary, List<String> tags)
            throws IOException, MediaWikiApiErrorException {
        /* ... */
    }
    public void editLexemeDocument(
            LexemeUpdate update, boolean clear, String summary, List<String> tags)
            throws IOException, MediaWikiApiErrorException {
        /* ... */
    }
}

Similar methods will exist for other entity types. Notice that createNewLexemeDocument returns only ID while editLexemeDocument does not return anything. That will avoid deserialization issues for lexemes (#437) and properties (#376).

All the new XUpdateBuilder/XUpdate classes would go in wdtk-datamodel module in the same packages as conceptually related XDocument/XDocumentBuilder classes, so that XUpdate objects can be constructed via DataObjectFactory and Datamodel.

Some existing classes and methods would be deprecated: StatementUpdate (the old one), TermStatementUpdate, createItemDocument, createPropertyDocument. editItemDocument, editPropertyDocument, editMediaInfoDocument, updateStatements, updateTermsStatements. All of them would have alternatives in the new API. GuidGenerator and RandomGuidGenerator would be deprecated without replacement.

447 would be rejected. #437, #403, and #376 would be resolved. New issue would track editing of sense statements.

Please comment. This is a big change, so there will be a stream of smaller PRs that can be tweaked, but I would like to hear your opinion about the overall design before submitting the first PR.

wetneb commented 3 years ago

Hey @robertvazan, this would be absolutely fantastic! I think it totally makes sense to leave the merging logic out of this at first, what you are proposing is a very sensible refactoring that feels like a natural first step.

There is no GUID generator. I don't know why is that supported on current StatementUpdate.

The GUID generator is required when making edits with the wbsetclaim action, which requires the caller to provide a fresh statement GUID if creating a new statement. It is important that this action is used when adding a single statement to an item, because it generates a more informative edit summary than using wbeditentity.

I have no strong feelings about how the GUID generator should be passed to the editing logic. It could be something covered by your Builder API, or some attribute of the WikibaseDataEditor instance. But it is important that it stays accessible for testing purposes.

It's unfortunate that the WIkibase API currently requires users to generate these GUIDs client-side with this API action, but until this is fixed we cannot get rid of GUID generation on our side, as far as I can tell.

robertvazan commented 3 years ago

@wetneb Thanks for the explanation. I think the GUID generator logically belongs in WikibaseDataEditor, because it is a requirement of the Wikibase API and it is only used in special cases, not in all updates, so update classes shouldn't know about it..

Tpt commented 3 years ago

This proposal looks great to me! Thank you!

robertvazan commented 3 years ago

How about breaking changes? Should I add createNewXDocument methods and deprecate createXDocument methods or should I just change return type of existing createXDocument methods? My original proposal does not break anything (I think), but then WDTK is in 0.x stage and I see that #327 is boldly proposing lots of breaking changes.

wetneb commented 3 years ago

It does not seem too hard to keep the existing API as deprecated methods alongside the new one, so I think we should do that.

Tpt commented 3 years ago

Notice that createNewLexemeDocument returns only ID while editLexemeDocument does not return anything. That will avoid deserialization issues for lexemes (#437) and properties (#376).

These issues are going to be fixed. So, it might make sense to return the new entity objects there too.

robertvazan commented 3 years ago

@Tpt Thanks for the heads-up. I will make all methods return full document.

robertvazan commented 3 years ago

Just a quick update. I was a bit busy during the last few weeks, but I am working on this. Current version is here:

https://github.com/robertvazan/Wikidata-Toolkit

The fork mostly works. I am already using the new update classes to edit lexemes on Wikidata. There are a few issues left that need to be resolved before I can send PR.

robertvazan commented 3 years ago

Editing of lexemes as well as other entity types is fully implemented. I have a test app that I used to manually test every possible edit operation. Statement editing on senses works too. The only remaining regression from current statement editing code (aside from merging, which was deliberately left out), is reducing wbeditentity calls to one of the specialized Wikibase APIs (e.g. wbsetclaim) when the update is simple, which is the typical use case. I am running into some issues there.

In order to return updated entity even when calling specialized Wikibase APIs, current code in StatementUpdate and TermStatementUpdate modifies the base revision document. But that is not possible with the new edit API, which allows blind edits where base revision of the entity is not provided. There are a few ways to resolve this problem:

  1. Do not return updated entity from edit operations.
  2. Sacrifice nice edit messages and use wbeditentity for everything.
  3. Prohibit blind edits and require base entity document to create XUpdateBuilder objects.
  4. Fetch the new document after performing the edit via wbgetentities.

What do you think WDTK should do? 2 is easiest to implement, followed by 1 and 4. 1 has perfect performance, 2 adds extra bandwidth, 3 and 4 add extra network roundtrip. All except 2 result in nice edit messages. 2 and 4 provide the most complete API, followed by 1 and 3.

option complexity performance summaries API
1 high perfect nice less complete
2 trivial extra bandwidth ugly complete
3 highest possibly extra roundtrip nice less complete
4 high extra roundtrip nice complete

There's no perfect solution. 1 and 4 seem to be the least problematic. 1 favors performance while 4 favors API completeness. What are your preferences?

wetneb commented 3 years ago

I'd say 1. : leave it to users whether they want to do an extra HTTP call to fetch the new entity. They might not need it so better not force them to pay that price. Also wbgetentities supports fetching multiple entities at once so it would be a waste to use it for a single entity after each edit.

robertvazan commented 3 years ago

I would add a note about practical implementation in apps. Updated entity returned by wbeditentity cannot be trusted in case of concurrent edits and also because wbeditentity may succeed in performing the edit but fail to return updated entity (say due to temporary network failure). Robust caching of entities will therefore ignore result of wbeditentity and perform separate wbgetentities call to ensure the cache reflects current state of the database. My app works this way. In light of this observation, option 1 seems superior to all alternatives.

Tpt commented 3 years ago

Thank you so much for moving forward with this! I also agree that option 1 is the best. Thank you again!

robertvazan commented 3 years ago

@wetneb @Tpt Can this be closed as mostly done, perhaps with new issues for specific missing features?