MaRDI4NFDI / portal-compose

docker-composer repo for mardi
https://portal.mardi4nfdi.de
GNU General Public License v3.0
3 stars 1 forks source link

How to model links between software and articles #491

Closed physikerwelt closed 4 months ago

physikerwelt commented 4 months ago

Describe the issue should there be a bidirectional connect? should the describing article (standard article) be linked with a different property?

Standard Articles std_sofware.csv

physikerwelt commented 4 months ago

Other links between software and article other_software.csv ca 500k links (might be too big for wdqs @Daniel-Mietchen)

physikerwelt commented 4 months ago

for the standard articles I propose described by source https://github.com/codemeta/codemeta/pull/351 which is equivalent to https://www.wikidata.org/wiki/Property:P1343

physikerwelt commented 4 months ago

@LizzAlice could you please help me to get the MaRDI client upload the large list. My current attempt is

https://github.com/MaRDI4NFDI/mardiclient/compare/main...physikerwelt:mardiclient:csv_playground?expand=1

the argument is

import_csv.py https://github.com/MaRDI4NFDI/portal-compose/files/14303414/other_software.csv

and I tried with a botpassword for my username and also with a dedicated account with the bot flag.

Whatever I try, after about 1000 successfully created items I see the following error


 File "/usr/local/lib/python3.12/site-packages/wikibaseintegrator/wbi_helpers.py", line 129, in mediawiki_api_call
raise MWApiError(json_data['error'])
wikibaseintegrator.wbi_exceptions.MWApiError: 'You do not have the "bot" right, so the action could not be completed.'
{'article': 'Q5021044', 'software': 'Q5974590'} 'You do not have the "bot" right, so the action could not be completed.'

for each entry.

In addition, items that have double redirects (authors) can not be edited. However, this does only affect a few items.

physikerwelt commented 4 months ago

By the way, I created a dedicated box to run the script independent of my computer using portainer. The box is called sofwareLinker. You can log in to that box to investigate the problem. I did export username and password and then run the following command


root@7427805896fe:/mardiclient# python import_csv.py https://github.com/MaRDI4NFDI/portal-compose/files/14303414/other_software.csv
physikerwelt commented 4 months ago

PS: The double-redirect error reads


Error while writing to the Wikibase instance
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/wikibaseintegrator/entities/baseentity.py", line 243, in _write
    json_result: dict = edit_entity(data=data, id=entity_id, type=self.type, summary=summary, clear=clear, is_bot=is_bot, allow_anonymous=allow_anonymous,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/wikibaseintegrator/wbi_helpers.py", line 333, in edit_entity
    return mediawiki_api_call_helper(data=params, is_bot=is_bot, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/wikibaseintegrator/wbi_helpers.py", line 215, in mediawiki_api_call_helper
    return mediawiki_api_call('POST', mediawiki_api_url=mediawiki_api_url, session=session, data=data, headers=headers, max_retries=max_retries, retry_after=retry_after, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/wikibaseintegrator/wbi_helpers.py", line 129, in mediawiki_api_call
    raise MWApiError(json_data['error'])
wikibaseintegrator.wbi_exceptions.MWApiError: '[f2d86520fe3ddc77707810ca] Exception caught: Unresolved redirect from Q1760199 to Q878049'
{'article': 'Q5889352', 'software': 'Q22135'} '[f2d86520fe3ddc77707810ca] Exception caught: Unresolved redirect from Q1760199 to Q878049'
^CTraceback (most recent call last):
LizzAlice commented 4 months ago

Yes, the double redirect error is the one I am fixing right now with my redirection script. You should probably wait until that is finished to run yours. As to the bot rights error, I am getting that too and also don't know why. Previously, that might have happened once or twice, but then worked again, but now that problem has not stopped since Friday.

physikerwelt commented 4 months ago

This is now running. When it's done the edit count of the importer should be slightly above 500k edits.

physikerwelt commented 4 months ago

26.2.24 12:10 -> 12 514 (edits)

physikerwelt commented 4 months ago

Failed

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='portal.mardi4nfdi.de', port=443): Max retries exceeded with url: /w/api.php (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f55417e15e0>: Failed to resolve 'portal.mardi4nfdi.de' ([Errno -5] No address associated with hostname)"))
physikerwelt commented 4 months ago

new run 27.2.24 08:23 17 115 (edits)

physikerwelt commented 4 months ago

It seems the error handling in the MaRDI importer is not really suitable for long running processes @LizzAlice ?

Now I'm getting the following error:



Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/usr/local/lib/python3.12/site-packages/urllib3/response.py", line 1033, in stream
    data = self.read(amt=amt, decode_content=decode_content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/response.py", line 953, in read
    data = self._raw_read(amt)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/response.py", line 851, in _raw_read
    with self._error_catcher():
  File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/usr/local/lib/python3.12/site-packages/urllib3/response.py", line 754, in _error_catcher
    raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(163218 bytes read, 7758921 more expected)', IncompleteRead(163218 bytes read, 7758921 more expected))

During handling of the above exception, another exception occurred:
LizzAlice commented 4 months ago

I don't know, I never got that error. Also I am doing script-side error handling

physikerwelt commented 4 months ago

I don't know, I never got that error. Also I am doing script-side error handling

Thank you for the prompt reply. The import process seems to be too slow to fetch the csv file from github (here).

LizzAlice commented 4 months ago

Hm, then maybe you could download it?

physikerwelt commented 4 months ago

Googling for the error message suggest one should use a job-queue:-) I'll just load the 30MB into memory...

physikerwelt commented 4 months ago

28.2. 11:43 -> 131,323 edits 28.2. 12:11 -> 135,840 edits

Thus, it will take two days before this is completed.

physikerwelt commented 4 months ago

1.3. 2:15 -> 507,965 edits

I think we are done!