MaRDI4NFDI / portal-compose

docker-composer repo for mardi
https://portal.mardi4nfdi.de
GNU General Public License v3.0
3 stars 1 forks source link

WDQS-Updater does not get correct entity data #445

Closed physikerwelt closed 5 months ago

physikerwelt commented 6 months ago

Describe the bug When editing a wikibase item the change is not written to blazegraph

9:29:39.125 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 1 changes, from Q1798943@11130557@20231218192931|9909574 to Q1798943@11130557@20231218192931|9909574

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized subjects: [https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943, https://portal.mardi4nfdi.de/entitystatement/Q1798943-6b949c35-40ed-f13c-f98e-e069b2c34706, https://portal.mardi4nfdi.de/entityQ1798943, https://portal.mardi4nfdi.de/entitystatement/Q1798943-30F56AFA-9818-4C33-8A83-22B8F56C6E08, https://portal.mardi4nfdi.de/entitystatement/Q1798943-A03BACD8-332F-4072-8123-D1B89ABBBF66] while processing https://portal.mardi4nfdi.de/entity/Q1798943.  Expected only sitelinks and subjects starting with https://portal.mardi4nfdi.de/wiki/Special:EntityData/ and [https://portal.mardi4nfdi.de/entity/]

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://www.w3.org/1999/02/22-rdf-syntax-ns#type o:http://schema.org/Dataset

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://schema.org/about o:https://portal.mardi4nfdi.de/entityQ1798943

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://creativecommons.org/ns#license o:http://creativecommons.org/publicdomain/zero/1.0/

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://schema.org/softwareVersion o:"1.0.0"^^<http://www.w3.org/2001/XMLSchema#string>

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://schema.org/version o:"11130557"^^<http://www.w3.org/2001/XMLSchema#integer>

�����

.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://schema.org/dateModified o:"2023-12-18T19:29:31Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://wikiba.se/ontology#statements o:"3"^^<http://www.w3.org/2001/XMLSchema#integer>

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://wikiba.se/ontology#sitelinks o:"1"^^<http://www.w3.org/2001/XMLSchema#integer>

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/wiki/mardi:Special:EntityData/Q1798943 p:http://wikiba.se/ontology#identifiers o:"2"^^<http://www.w3.org/2001/XMLSchema#integer>

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entitystatement/Q1798943-6b949c35-40ed-f13c-f98e-e069b2c34706 p:http://www.w3.org/1999/02/22-rdf-syntax-ns#type o:http://wikiba.se/ontology#Statement

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entitystatement/Q1798943-6b949c35-40ed-f13c-f98e-e069b2c34706 p:http://www.w3.org/1999/02/22-rdf-syntax-ns#type o:http://wikiba.se/ontology#BestRank

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entitystatement/Q1798943-6b949c35-40ed-f13c-f98e-e069b2c34706 p:http://wikiba.se/ontology#rank o:http://wikiba.se/ontology#NormalRank

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entitystatement/Q1798943-6b949c35-40ed-f13c-f98e-e069b2c34706 p:https://portal.mardi4nfdi.de/entityprop/statement/P12 o:"Q42883470"^^<http://www.w3.org/2001/XMLSchema#string>

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entityQ1798943 p:http://www.w3.org/1999/02/22-rdf-syntax-ns#type o:http://wikiba.se/ontology#Item

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entityQ1798943 p:https://portal.mardi4nfdi.de/entityprop/direct/P31 o:https://portal.mardi4nfdi.de/entityQ57162

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entityQ1798943 p:https://portal.mardi4nfdi.de/entityprop/direct/P676 o:"schubotz.moritz"^^<http://www.w3.org/2001/XMLSchema#string>

19:29:39.200 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entityQ1798943 p:https://portal.mardi4nfdi.de/entityprop/direct/P12 o:"Q42883470"^^<http://www.w3.org/2001/XMLSchema#string>

19:29:39.201 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entityQ1798943 p:https://portal.mardi4nfdi.de/entityprop/P31 o:https://portal.mardi4nfdi.de/entitystatement/Q1798943-A03BACD8-332F-4072-8123-D1B89ABBBF66

19:29:39.201 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entityQ1798943 p:https://portal.mardi4nfdi.de/entityprop/P676 o:https://portal.mardi4nfdi.de/entitystatement/Q1798943-30F56AFA-9818-4C33-8A83-22B8F56C6E08

19:29:39.201 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:https://portal.mardi4nfdi.de/entityQ1798943 p:https://portal.mardi4nfdi.de/entityprop/P12 o:https://portal.mardi4nfdi.de/entitystatement/Q1798943-6b949c35-40ed-f13c-f98e-e069b2c34706

19:29:39.201 [update 1] INFO  o.wikidata.query.rdf.tool.rdf.Munger - More than 20 unrecognized statements, further statements not logged.

19:29:39.201 [update 1] WARN  org.wikidata.query.rdf.tool.Updater - Contained error syncing.  Giving up on Q1798943

org.wikidata.query.rdf.tool.exception.ContainedException: Didn't get a revision id for [(https://portal.mardi4nfdi.de/wiki/Person:1798943, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://schema.org/Article), (https://portal.mardi4nfdi.de/wiki/Person:1798943, http://schema.org/about, https://portal.mardi4nfdi.de/entityQ1798943), (https://portal.mardi4nfdi.de/wiki/Person:1798943, http://schema.org/inLanguage, "en"^^<http://www.w3.org/2001/XMLSchema#string>), (https://portal.mardi4nfdi.de/wiki/Person:1798943, http://schema.org/isPartOf, https://portal.mardi4nfdi.de/), (https://portal.mardi4nfdi.de/wiki/Person:1798943, http://schema.org/name, "Person:1798943"@en), (https://portal.mardi4nfdi.de/, http://wikiba.se/ontology#wikiGroup, "mathematics"^^<http://www.w3.org/2001/XMLSchema#string>)]

    at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:818)

    at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:413)

    at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:144)

    at org.wikidata.query.rdf.tool.Updater.handleChange(Updater.java:415)

    at org.wikidata.query.rdf.tool.Updater.lambda$fetchDataFromWikibaseAndMunge$7(Updater.java:283)

    at java.util.concurrent.FutureTask.run(FutureTask.java:266)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

19:29:39.286 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2023-12-18T19:29:31Z at (0.0, 0.0, 0.0) updates per second and (1.5, 74.8, 144.1) milliseconds per second

The reason for that is that the prefixes for the RDF data are not correct. Check out

https://portal.mardi4nfdi.de/wiki/Special:EntityData/Q1798943.rdf

Expected behavior The change should be written to blazegraph

To Reproduce Steps to reproduce the behavior:

  1. Make a change like this one https://portal.mardi4nfdi.de/w/index.php?title=Item:Q1798943&action=history
  2. Investigate the logs of the wdqs-updater container

Screenshots

Additional context

Checklist for this issue: (Some checks for making sure this issue is completely formulated)

eloiferrer commented 6 months ago

Also in local development starting with an empty instance, when an item is created the updater fails to synchronize returning:

08:39:41.432 [update 0] WARN org.wikidata.query.rdf.tool.Updater - Contained error syncing. Giving up on Q1 org.wikidata.query.rdf.tool.rdf.Munger$BadSubjectException: Unrecognized subjects: [http://wikibase.svc:80/entity/Q1, http://wikibase.svc:80/wiki/Special:EntityData/Q1]. Expected only sitelinks and subjects starting with http://wikibase.svc/wiki/Special:EntityData/ and [http://wikibase.svc/entity/] at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:800) at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:413) at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:144) at org.wikidata.query.rdf.tool.Updater.handleChange(Updater.java:436) at org.wikidata.query.rdf.tool.Updater.lambda$fetchDataFromWikibaseAndMunge$7(Updater.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

physikerwelt commented 6 months ago

With the following Wikibase settings file this problem goes away for me:


<?php
## Wikibase

wfLoadExtension( 'WikibaseClient', "$IP/extensions/Wikibase/extension-client.json" );
require_once "$IP/extensions/Wikibase/client/ExampleSettings.php";

$wikibaseHost = getenv('WIKIBASE_HOST');

if ($wikibaseHost === 'localhost') {
    $portalHost = getenv('WIKIBASE_SCHEME') . '://localhost:' . getenv('WIKIBASE_PORT');
} else {
    $portalHost = getenv('WIKIBASE_SCHEME') . '://'. $wikibaseHost;
}

# enable linking between wikibase and content pages
$wgWBRepoSettings['siteLinkGroups'] = [ 'mathematics' ];
$wgWBClientSettings['siteLinkGroups'] = [ 'mathematics' ];
$wgWBClientSettings['siteGlobalID'] = 'mardi';
$wgWBClientSettings['repoUrl'] = $portalHost;
$wgWBClientSettings['repoScriptPath'] = '/w';
$wgWBClientSettings['repoArticlePath'] = '/wiki/$1';
$wgWBClientSettings['entitySources'] = [
        'mardi_source' => [
                'repoDatabase' => 'my_wiki',
                'baseUri' => $portalHost . '/entity',
                'entityNamespaces' => [
                        'item' => 120,
                        'property' => 122,
                ],
                'rdfNodeNamespacePrefix' => 'wd',
                'rdfPredicateNamespacePrefix' => '',
                'interwikiPrefix' => 'mardi',
        ],
];
$wgWBClientSettings['itemAndPropertySourceName'] = 'mardi_source';
// my_wiki is the MaRDI database
$wgLocalDatabases = [ 'wiki_swmath', 'my_wiki' ];

// https://github.com/MaRDI4NFDI/portal-compose/issues/224
$wgNamespacesToBeSearchedDefault[122] = true; // WB_PROPERTY_NAMESPACE===122

if ( $wgDBname !== 'wiki_swmath' ){

    wfLoadExtension( 'WikibaseRepository', "$IP/extensions/Wikibase/extension-repo.json" );
    // from https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/f2bd35609b6bf3f8d38ef8c78d2f340497906706/repo/includes/RepoHooks.php#L170C1-L180C61
    $wgExtraNamespaces[120] = 'Item';
    $wgExtraNamespaces[121] = 'Item_talk';
    $wgExtraNamespaces[122] = 'Property';
    $wgExtraNamespaces[123] = 'Property_talk';
    // do not declare namespaces if that would be done by default https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/933906 https://phabricator.wikimedia.org/T291617
    $wgWBRepoSettings['defaultEntityNamespaces'] = false;
    $wgWBRepoSettings['entitySources'] = [
            'mardi_source' => [
                'repoDatabase' => 'my_wiki',
                'baseUri' => 'http://wikibase.svc/entity/',
                'entityNamespaces' => [
                        'item' => 120,
                        'property' => 122,
                ],
                'rdfNodeNamespacePrefix' => 'wd',
                'rdfPredicateNamespacePrefix' => '',
                'interwikiPrefix' => 'mardi',
        ],
    ];
    $wgWBRepoSettings['localEntitySourceName'] = 'mardi_source';
    $wgWBRepoSettings['localClientDatabases'] = [
        'mardi' => 'my_wiki',
        'swmath' => 'wiki_swmath'
    ];
    // insert site with
    // php addSite.php --filepath=https://portal.mardi4nfdi.de/w/\$1 --pagepath=https://portal.mardi4nfdi.de/wiki/\$1 --language en --interwiki-id mardi mardi mathematics
    // php addSite.php --filepath=https://staging.swmath.org/w/\$1 --pagepath=https://staging.swmath.org/wiki/\$1 --language en --interwiki-id swmath swmath mathematics

    # Pingback
    $wgWBRepoSettings['wikibasePingback'] = false;

    # Increase string size limits
    $wgWBRepoSettings['string-limits'] = [
        'VT:string' => [
            'length' => 200000,
        ],
        'multilang' => [
            'length' => 2000,
        ],
        'VT:monolingualtext' => [
            'length' => 1000,
        ],
    ];
}
physikerwelt commented 6 months ago

One can get the RDF format from the cli (without cache via): ``` root@f87c36c7a418:/var/www/html/extensions/Wikibase/repo/maintenance# php dumpRdf.php --first-page-id 1809685 --limit 1 --format rdf --no-cache

eloiferrer commented 6 months ago

I've tried with your version of Wikibase.php. It partially fixes it for me, but I get now this error:

08:06:40.459 [main] INFO o.w.q.r.t.change.RecentChangesPoller - Got 1 changes, from Q1@2@20231220080630|3 to Q1@2@20231220080630|3 08:06:40.691 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized subjects: [http://wikibase.svc:80/wiki/Special:EntityData/Q1] while processing http://wikibase.svc/entity/Q1. Expected only sitelinks and subjects starting with http://wikibase.svc/wiki/Special:EntityData/ and [http://wikibase.svc/entity/] 08:06:40.697 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://www.w3.org/1999/02/22-rdf-syntax-ns#type o:http://schema.org/Dataset 08:06:40.697 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://schema.org/about o:http://wikibase.svc/entity/Q1 08:06:40.697 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://creativecommons.org/ns#license o:http://creativecommons.org/publicdomain/zero/1.0/ 08:06:40.697 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://schema.org/softwareVersion o:"1.0.0"^^http://www.w3.org/2001/XMLSchema#string 08:06:40.698 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://schema.org/version o:"2"^^http://www.w3.org/2001/XMLSchema#integer 08:06:40.698 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://schema.org/dateModified o:"2023-12-20T08:06:30Z"^^http://www.w3.org/2001/XMLSchema#dateTime 08:06:40.698 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://wikiba.se/ontology#statements o:"0"^^http://www.w3.org/2001/XMLSchema#integer 08:06:40.698 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://wikiba.se/ontology#sitelinks o:"0"^^http://www.w3.org/2001/XMLSchema#integer 08:06:40.698 [update 0] INFO o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized statement: s:http://wikibase.svc:80/wiki/Special:EntityData/Q1 p:http://wikiba.se/ontology#identifiers o:"0"^^http://www.w3.org/2001/XMLSchema#integer 08:06:40.699 [update 0] WARN org.wikidata.query.rdf.tool.Updater - Contained error syncing. Giving up on Q1 org.wikidata.query.rdf.tool.exception.ContainedException: Didn't get a revision id for [(http://wikibase.svc/entity/Q1, http://www.w3.org/2000/01/rdf-schema#label, "test"@en)] at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:818) at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:413) at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:144) at org.wikidata.query.rdf.tool.Updater.handleChange(Updater.java:436) at org.wikidata.query.rdf.tool.Updater.lambda$fetchDataFromWikibaseAndMunge$7(Updater.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

eloiferrer commented 5 months ago

The problem still exists locally if one uses the last version of docker-wikibase and the Wikibase.php shown above. I assume the problem will appear again in production once we use again the last version of docker-wikibase and not the current frozen state.

physikerwelt commented 5 months ago

To me, it was hard to test locally as the problem is related to the URL and I still believe using all the ports is a lot of effort and creates an environment that is not similar enough to production to make a statement. If we could install portainer to function locally eg. for http://portal.local this would be more similar, but the protocol http would still be different from https. So testing this in a staging environment might be a good solution.

eloiferrer commented 5 months ago

This works now both on staging.mardi4nfdi.org and on portal.mardi4nfdi.de For the local development environment it has been solved with https://github.com/MaRDI4NFDI/portal-compose/pull/470