Wikidata / Wikidata-Toolkit

Java library to interact with Wikibase
https://www.mediawiki.org/wiki/Wikidata_Toolkit
Apache License 2.0
375 stars 99 forks source link

Add Lexeme editing #437

Closed wetneb closed 3 years ago

wetneb commented 5 years ago

We now have support for Lexeme entities in the datamodel. We could also support editing these in wdtk-wikibaseapi.

Tpt commented 4 years ago

https://phabricator.wikimedia.org/T202725 and https://phabricator.wikimedia.org/T199896 will make the implementation a bit cumbersome. Fixing these limitations properly in WikibaseLexeme is probably going to be as hard as writing a circumvention in WikidataToolkit

wetneb commented 4 years ago

Interesting!

For the first one (T202725) this is related to #376: we should not assume that wbeditentity returns the full entity as this is only the case for Items apparently (not even properties).

For the second one I guess this means we need to implement other API actions, which are not going to be atomic indeed. I think we need to rethink the architecture of this module - I wanted to do that for a long time (#403) but haven't got round to it yet. This is related to my second bullet point in that ticket.

Tpt commented 4 years ago

Indeed, having support for other API actions would be great. But I believe these limitation should also be removed on the WikibaseLexeme side.

thadguidry commented 4 years ago

Also related: T249206 - Serialized statements of Forms and Senses are missing data type fields

datatype field for each Snak was missing previously for statements within Senses and Forms of a Lexeme. Starting on 26 August 2020 the datatype fields will be present on them. [Reference: Wikidata Project Chat]

62mkv commented 4 years ago

hi team! sorry for a noob question, but.. how does one create a lexeme (with forms) using WDTK? I see that "datamodel" artifact has quite extensive support for Lexemes, but could not find anything in the WikibaseDataEditor. Is it because of this issue? if so, what would be the suggested workaround? Thanks in advance.

wetneb commented 4 years ago

Hi @62mkv, that's correct: editing and creating lexemes is not supported yet in WDTK.

62mkv commented 4 years ago

thanks, @wetneb! I am looking into it now, seeing what is possible as a quick hack to be able to just a) create lexemes with forms or b) add forms to existing lexeme. Currently the stumbling point for me is an apparent inability to create a FormDocument as a new Form (with "null" id) for a given Lexeme.

@Tpt I see you've added those types, what would be your suggestion on how to resolve it?

PS: there seem to be tools that are capable of what I need, in particular LexData (although they seem to be using different action, "wbladdform") and https://github.com/lucaswerkmeister/tool-lexeme-forms/, but I'd really not want to abandon Java for this... TIA!

Tpt commented 4 years ago

Hi @62mkv!

To create a new Form with the WDTK datamodel, you can use the LexemeDocument.createForm method. This will properly generate a new form identifier and add the form to the lexeme object.

Then, we need to implement lexemes/forms and senses saving. The wbeditentity API action we use for forms and senses is a bit limited (c.f. the upper discussions). If you are familiar with PHP, the easiest way to go is probably to just fix the MediaWiki WikibaseLexeme extension. If not, maybe some hacks with the existing API actions wbaddform... might do the job.

62mkv commented 4 years ago

thanks @Tpt ! I guess that would cover my use-case №1 (create and add forms) but how do I get LexemeDocument, if I have an L-id already?

I see that I can use WbGetEntitiesAction to get an EntityDocument but how to obtain a proper LexemeDocument out of that?

62mkv commented 4 years ago

This will properly generate a new form identifier and add the form to the lexeme object.

by the way, javadoc on that method says

    /**
     * Creates a new {@link FormDocument} for this lexeme.
     * The form is not added to the {@link LexemeDocument} object,
     * it should be done with {@link LexemeDocument#withForm}.
     */
Tpt commented 4 years ago

I see that I can use WbGetEntitiesAction to get an EntityDocument but how to obtain a proper LexemeDocument out of that?

You could just cast using the usual (LexemeDocument).

by the way, javadoc on that method says

Indeed, my bad.

62mkv commented 4 years ago

Cool! and by the way, if I try to call LexemeDocument.createForm on a not-yet added lexeme, it throws an exception

java.lang.IllegalArgumentException: The string L0-F1 is not a valid form id

    at org.wikidata.wdtk.datamodel.implementation.FormIdValueImpl.<init>(FormIdValueImpl.java:65)

so, it seems like there's no easy way to create lexeme AND add forms in a single wbeditaction hop... I'll try with WbGet now

62mkv commented 4 years ago

so, with this code:

        LexemeDocument existingLexeme = (LexemeDocument) wikibaseDataFetcher.getEntityDocument("L1358");
        FormDocument formDocument = existingLexeme.createForm(
                Collections.singletonList(Datamodel.makeMonolingualTextValue("aprils", LANGUAGE_CODE)),
                Collections.singletonList(getItemIdForTestWikidata("Q42"))
        );

        LexemeDocument withForm = existingLexeme.withForm(formDocument);

        LexemeDocument result = wikibaseDataEditor
                .createLexemeDocument(withForm, "Adding form to existing lexeme", null);

i am getting this request string:

summary=Adding form to existing lexeme&new=lexeme&maxlag=5&data={"type":"lexeme","id":"L1358","lexicalCategory":"Q212131","language":"Q208912","lemmas":{"en":{"language":"en","value":"april"}},"claims":{},"forms":[{"id":"L1358-F1","representations":{"en":{"language":"en","value":"aprils"}},"grammaticalFeatures":["Q42"],"claims":{},"lastrevid":533196,"type":"form"}],"senses":[],"lastrevid":533196}&bot=&assert=user&format=json&action=wbeditentity&token

and this MediaWikiException:

org.wikidata.wdtk.wikibaseapi.apierrors.MediaWikiApiErrorException: [param-invalid] Invalid field used in call: "id", must match id parameter

is it problem with my code, the WDTK unreadiness, or Wikidata API problem? I can't tell :( to me, request content looks legit. it correctly shows lemma, lexeme id, form with features..

UPD: aha, so, looking at the documentation for wbeditaction, (https://www.wikidata.org/w/api.php?action=help&modules=wbeditentity) it seems as though id parameter is missing. will look as to why that might happen

62mkv commented 4 years ago

dang, and if I mess with WbDataEditor to edit and not create lexemes, when new form is given as above, this is what I get from MediaWiki API:

org.wikidata.wdtk.wikibaseapi.apierrors.MediaWikiApiErrorException: [modification-failed] Lexeme does not have Form with given ID

so apparently you can't add forms with wbeditentity, dammit...

Tpt commented 4 years ago

so apparently you can't add forms with wbeditentity, dammit...

Yes, sadly. The Wikibase API for form and sense editing is currently in an unfinished state.

62mkv commented 4 years ago

yep. I've just tried to hack on FormDocument yet again, so that payload for wbeditentity looked like this:

{"type":"lexeme","id":"L1358","lexicalCategory":"Q212131","language":"Q208912","lemmas":{"en":{"language":"en","value":"april"}},"claims":{},"forms":[{"representations":{"en":{"language":"en","value":"aprils"}},"grammaticalFeatures":["Q42"],"claims":{},"lastrevid":533196,"type":"form"}],"senses":[],"lastrevid":533196}

and MediaWiki even gives "OK"-ish response:

 {"entity":{"claims":{},"id":"L1358","type":"lexeme","lastrevid":533196,"nochange":""},"success":1}

but still, nothing seems to be added to WD Lexeme at all. In fact, I can't even find any traces of this request execution on "test.wikidata.org" at all.. is it yet another bug of Wikibase API? .. meh

PS: does "nochange": "" in the response indicate that wiki-engine considered my request a no-op and that might explain why am I not seeing any logs of it?

62mkv commented 4 years ago

Hooooooy! I've managed to both create lexeme with forms and to add forms to existing lexeme. The key was this nugget: https://github.com/nyurik/lexicator/blob/master/lexicator/lexemer/LexemeParserState.py#L182 (thanks to @nyurik for help)!

the code is super-ugly but at least I should be able to progress with this.

robertvazan commented 3 years ago

Hello. What's the status of lexeme editing? I have a private lexeme editing library that is in some ways more capable than WDTK and in others less capable. I am at the crossroads choosing between major upgrade to my private library or switching to WDTK and upgrading it with a series of smaller pull requests.

WDTK will mostly work for me. I have only encountered following issues:

I can send pull requests for the first two issues, but the third one is a deal-breaker. Why is lexeme editing in a branch for so long? Is it seriously broken? When is it going to be merged? Why wasn't it merged already?

The other thing I am thinking about is the editing API. #403 is overkill for my use case. Ideally, I would prefer to just have mutable entities and have an API that computes diff from original entity and modified one and then writes the diff. But at the moment the whole model is immutable. Bare diff API is nevertheless good enough, although the updateStatements method is begging for a builder class. That can be done with a PR too.

wetneb commented 3 years ago

Hi @robertvazan, thanks for offering to contribute on this!

Personally, I was not aware of the lexeme-editing branch at all. If this branch has been useful to you and you don't see any big issue about it, then you could open a pull request for it, potentially adding any further changes you have made on your side. I think it would be very welcome and I would be keen to review it.

Let's also ping the author @Tpt.

robertvazan commented 3 years ago

@wetneb I haven't started using WDTK yet. Can I just ignore the branch then and submit PRs to master?

wetneb commented 3 years ago

If you did not use this branch yourself, then yes it's fine to submit PRs based on master. But it could be worth waiting a bit for @Tpt to understand why this branch was left unmerged.

Tpt commented 3 years ago

I have not merged this branch because it is still buggy. Indeed a few features are still missing in WikibaseLexeme to be able to use the wbeditentity API just like we do on items and properties:

  1. It is not possible to edit sense as part of a lexeme edit: https://phabricator.wikimedia.org/T199896
  2. The JSON result with the new version of the lexeme is incomplete and sometime wrong: https://phabricator.wikimedia.org/T202725 https://phabricator.wikimedia.org/T200255 https://phabricator.wikimedia.org/T271105

Feel free to ignore my branch or take the relevant bits from it and integrate your own code.

robertvazan commented 3 years ago

@Tpt WDTK can just implement Wikibase API to the extent it is implemented in Wikibase itself. Known unsupported request features can be detected and terminated with exception before they hit network. Incomplete responses can be either mapped to incomplete WDTK objects or an additional read requests can be made. This can be all documented. This way WDTK can expose available APIs to the maximum extent possible.

Tpt commented 3 years ago

@robertvazan That would be great! If you could implement it, it would be amazing!

robertvazan commented 3 years ago

Just FYI: I have tested wbeditentity on test Wikidata and most of the lexeme can be edited. The only exception is sense statements. Senses themselves (addition/removal) and their glosses are editable though. Editing of forms and senses works both directly via form/sense ID and via lexeme except for the mentioned sense statements. The returned JSON is indeed incomplete. It is only useful to obtain lexeme ID.

There are some inconsistencies in editing various parts of the lexeme. The following procedures were tested to work.

Lemma

Language and lexical category

Lexeme statements

Qualifiers and references These cannot be edited on their own. They are part of the statement. Modifying the statement without repeating qualifiers and references will delete them.

Forms

Form representations Like lemmas.

Grammatical features

Form statements Like lexeme statements, just nested under form in JSON.

Senses Like forms.

Glosses Like lemmas.

Sense statements Not supported. All edits are ignored.

Auregann commented 3 years ago

Hello there, I just wanted to let you know that we fixed the issue that was preventing to edit Senses and statements from wbeditentity (T199896) which we hope will help tool maintainers to support Lexemes. We would of course love to see Wikidata Toolkit supporting Lexemes as it would be helpful to increase and diversify the tools base to edit Lexemes :)

If you have questions, issues or requests, feel free to contact me (not on this account as it's my personal one, rather at lea.lacroix@wikimedia.de) Thanks!