Closed wetneb closed 3 years ago
Thank you for this proposal. It looks very promising!
A maybe relevant thing to look at are the Wikibase extension ChangeOp classes that are responsible of validating and applying changes to entities in the API implementation: https://github.com/wikimedia/mediawiki-extensions-Wikibase/tree/master/repo/includes/ChangeOp
@Tpt makes sense, as the library gets more precise it should eventually converge towards the reference implementation indeed.
The structure of the merging strategies is still not entirely clear to me so I might try to design it in the scope of OR, by keeping the classes independent from the rest of the code base so that they can be easily reused (once we have a better understanding of how these things should work).
The structure of the merging strategies is still not entirely clear to me so I might try to design it in the scope of OR
That looks like a good idea to me. And when you are going to be happy with the design we could move them to WDTK.
Wikimedia Germany is planning to implement a REST API for Wikibase: https://www.wikidata.org/wiki/Wikidata:REST_API It might make the WDTK editing part much easier
@Tpt @wetneb I have successfully integrated WDTK in my code and added a layer of extension code to cover missing functionality. Now I am looking for ways to upstream all these extensions. I am considering an API similar to the one proposed above, so I am posting this here.
Every edit would start with a builder:
public abstract class EntityUpdateBuilder {
public static EntityUpdateBuilder fromId(EntityIdValue id) { /* ... */ }
public static EntityUpdateBuilder fromDocument(EntityDocument document) { /* ... */ }
public abstract EntityUpdate build() { /* ... */ }
}
public abstract class StatementUpdateBuilder extends EntityUpdateBuilder {
public static StatementUpdateBuilder fromId(EntityIdValue id) { /* ... */ }
public static StatementUpdateBuilder fromDocument(StatementDocument document) { /* ... */ }
public void addStatement(Statement statement) { /* ... */ }
public void replaceStatement(Statement statement) { /* ... */ }
public void removeStatement(String id) { /* ... */ }
@Override public abstract StatementUpdate build() { /* ... */ }
}
public class LexemeUpdateBuilder extends StatementUpdateBuilder {
public static LexemeUpdateBuilder fromId(LexemeIdValue id) { /* ... */ }
public static LexemeUpdateBuilder fromDocument(LexemeDocument document) { /* ... */ }
public void setLanguage(ItemIdValue language) { /* ... */ }
public void setLexicalCategory(ItemIdValue category) { /* ... */ }
public void setLemma(MonolingualTextValue lemma) { /* ... */ }
public void removeLemma(String languageCode) { /* ... */ }
public void addForm(FormDocument form) { /* ... */ }
public void updateForm(FormUpdate update) { /* ... */ }
public void removeForm(FormIdValue id) { /* ... */ }
@Override public LexemeUpdate build() { /* ... */ }
}
I am omitting a lot of the API for brevity (items, senses, properties). This basic API can be extended with convenience methods and fluent chaining, but I want to focus on more important design issues here:
setX
, addX
, removeX
, etc. methods can be smarter about checks and other things than plain withX
/withoutX
.mergeX
methods, because merging logic is inherently application-specific.LexemeEditor
does not try to do everything.removeStatement
will check the statement exists while addStatement
will strip statement ID.StatementUpdate
.Method build()
then creates update object:
public interface EntityUpdate {
EntityIdValue getEntityId();
EntityDocument getCurrentDocument();
}
public interface StatementUpdate extends EntityUpdate { // not to be confused with current StatementUpdate
@Override StatementDocument getCurrentDocument();
List<Statement> getAddedStatements();
Map<String, Statement> getReplacedStatements();
Set<String> getRemovedStatements();
}
public interface LexemeUpdate extends StatementUpdate {
@Override LexemeIdValue getEntityId();
@Override LexemeDocument getCurrentDocument();
Optional<ItemIdValue> getLanguage();
Optional<ItemIdValue> getLexicalCategory();
Map<String, MonolingualTextValue> getAssignedLemmas();
Set<String> getRemovedLemmas();
List<FormDocument> getAddedForms();
Map<FormIdValue, FormUpdate> getUpdatedForms();
Set<FormIdValue> getRemovedForms();
}
These objects are only useful for inspection of the planned changes. They have corresponding LexemeUpdateImpl
and similar classes that implement JSON serialization as expected by wbeditentity
. Optional
is returned by some methods per #327.
Finally, WikibaseDataEditor
gets new methods:
public class WikibaseDataEditor {
public LexemeIdValue createNewLexemeDocument(
LexemeDocument document, boolean clear, String summary, List<String> tags)
throws IOException, MediaWikiApiErrorException {
/* ... */
}
public void editLexemeDocument(
LexemeUpdate update, boolean clear, String summary, List<String> tags)
throws IOException, MediaWikiApiErrorException {
/* ... */
}
}
Similar methods will exist for other entity types. Notice that createNewLexemeDocument
returns only ID while editLexemeDocument
does not return anything. That will avoid deserialization issues for lexemes (#437) and properties (#376).
All the new XUpdateBuilder/XUpdate classes would go in wdtk-datamodel module in the same packages as conceptually related XDocument/XDocumentBuilder classes, so that XUpdate objects can be constructed via DataObjectFactory
and Datamodel
.
Some existing classes and methods would be deprecated: StatementUpdate
(the old one), TermStatementUpdate
, createItemDocument
, createPropertyDocument
. editItemDocument
, editPropertyDocument
, editMediaInfoDocument
, updateStatements
, updateTermsStatements
. All of them would have alternatives in the new API. GuidGenerator
and RandomGuidGenerator
would be deprecated without replacement.
Please comment. This is a big change, so there will be a stream of smaller PRs that can be tweaked, but I would like to hear your opinion about the overall design before submitting the first PR.
Hey @robertvazan, this would be absolutely fantastic! I think it totally makes sense to leave the merging logic out of this at first, what you are proposing is a very sensible refactoring that feels like a natural first step.
There is no GUID generator. I don't know why is that supported on current StatementUpdate.
The GUID generator is required when making edits with the wbsetclaim action, which requires the caller to provide a fresh statement GUID if creating a new statement. It is important that this action is used when adding a single statement to an item, because it generates a more informative edit summary than using wbeditentity
.
I have no strong feelings about how the GUID generator should be passed to the editing logic. It could be something covered by your Builder API, or some attribute of the WikibaseDataEditor
instance. But it is important that it stays accessible for testing purposes.
It's unfortunate that the WIkibase API currently requires users to generate these GUIDs client-side with this API action, but until this is fixed we cannot get rid of GUID generation on our side, as far as I can tell.
@wetneb Thanks for the explanation. I think the GUID generator logically belongs in WikibaseDataEditor
, because it is a requirement of the Wikibase API and it is only used in special cases, not in all updates, so update classes shouldn't know about it..
This proposal looks great to me! Thank you!
How about breaking changes? Should I add createNewXDocument
methods and deprecate createXDocument
methods or should I just change return type of existing createXDocument
methods? My original proposal does not break anything (I think), but then WDTK is in 0.x stage and I see that #327 is boldly proposing lots of breaking changes.
It does not seem too hard to keep the existing API as deprecated methods alongside the new one, so I think we should do that.
Notice that createNewLexemeDocument returns only ID while editLexemeDocument does not return anything. That will avoid deserialization issues for lexemes (#437) and properties (#376).
These issues are going to be fixed. So, it might make sense to return the new entity objects there too.
@Tpt Thanks for the heads-up. I will make all methods return full document.
Just a quick update. I was a bit busy during the last few weeks, but I am working on this. Current version is here:
https://github.com/robertvazan/Wikidata-Toolkit
The fork mostly works. I am already using the new update classes to edit lexemes on Wikidata. There are a few issues left that need to be resolved before I can send PR.
Editing of lexemes as well as other entity types is fully implemented. I have a test app that I used to manually test every possible edit operation. Statement editing on senses works too. The only remaining regression from current statement editing code (aside from merging, which was deliberately left out), is reducing wbeditentity
calls to one of the specialized Wikibase APIs (e.g. wbsetclaim
) when the update is simple, which is the typical use case. I am running into some issues there.
In order to return updated entity even when calling specialized Wikibase APIs, current code in StatementUpdate
and TermStatementUpdate
modifies the base revision document. But that is not possible with the new edit API, which allows blind edits where base revision of the entity is not provided. There are a few ways to resolve this problem:
wbeditentity
for everything.XUpdateBuilder
objects.wbgetentities
.What do you think WDTK should do? 2 is easiest to implement, followed by 1 and 4. 1 has perfect performance, 2 adds extra bandwidth, 3 and 4 add extra network roundtrip. All except 2 result in nice edit messages. 2 and 4 provide the most complete API, followed by 1 and 3.
option | complexity | performance | summaries | API |
---|---|---|---|---|
1 | high | perfect | nice | less complete |
2 | trivial | extra bandwidth | ugly | complete |
3 | highest | possibly extra roundtrip | nice | less complete |
4 | high | extra roundtrip | nice | complete |
There's no perfect solution. 1 and 4 seem to be the least problematic. 1 favors performance while 4 favors API completeness. What are your preferences?
I'd say 1. : leave it to users whether they want to do an extra HTTP call to fetch the new entity. They might not need it so better not force them to pay that price. Also wbgetentities
supports fetching multiple entities at once so it would be a waste to use it for a single entity after each edit.
I would add a note about practical implementation in apps. Updated entity returned by wbeditentity
cannot be trusted in case of concurrent edits and also because wbeditentity
may succeed in performing the edit but fail to return updated entity (say due to temporary network failure). Robust caching of entities will therefore ignore result of wbeditentity
and perform separate wbgetentities
call to ensure the cache reflects current state of the database. My app works this way. In light of this observation, option 1 seems superior to all alternatives.
Thank you so much for moving forward with this! I also agree that option 1 is the best. Thank you again!
@wetneb @Tpt Can this be closed as mostly done, perhaps with new issues for specific missing features?
Wikidata-Toolkit's editing features are nice but fairly limited at the moment. Arbitrary changes can be made to items and properties, but few API actions are supported (which means that the edit summaries are not as meaningful as they should be). And now that lexemes are around, we should also allow editing for these.
My PR #323 was one step in this direction but I don't think it will scale very well. I think it is time to step back and redesign the architecture of this editing module.
As a user, at the moment, the go-to entry point for editing is
WikibaseDataEditor
, and more specifically methods like this: https://github.com/Wikidata/Wikidata-Toolkit/blob/8cd4fbb5e3a9014d06e84bd8ea2e2817e8146aea/wdtk-wikibaseapi/src/main/java/org/wikidata/wdtk/wikibaseapi/WikibaseDataEditor.java#L437-L444Beyond the lengthy arguments list, this isn't exactly nice to work with. This abstracts away in one single method a lot of different things: all the merging logic (how do you match the added statements to the existing ones on the item?) and the uploading logic (which API actions do you use to actually upload the changes?). Both of these things should be configurable:
There are various objects involved in this process:
EntityDocument
updateTermsStatement
method, that is: a high-level description of the changes to be made, in terms of added / deleted high-level things (statements, terms, forms, …). Importantly, such classes make no reference to the current document for that entity - the data they contain has not been compared with the current document state yet.EntityDocument
, some planned changes on them, and produce the new state of the entity (anEntityDocument
again).EntityPage
which would give easy access to variousEntityDocument
s corresponding to various revisions of the same entity, and compute diffs between them. We could tie that to the Wikibase API to fetch histories of entities.This is quite a lot of work, but once we have this, I think it should give a really nice editing experience. These changes of architecture can be made in a completely non-breaking way (it is easy to maintain the current interface of
WikibaseDataEditor
and propose a more granular access in parallel).I am obviously writing all this with OpenRefine in mind - and I can very well implement this architecture in OpenRefine directly, but I think this is really generic and ought to be available in WDTK.