Open hughrun opened 3 years ago
Hi @hughrun,
auto matching in't a good idea. In some languages the pages are named differently. If you only change the third level domain name you get wrong links.
But you can, starting with one URL, automatically get all the others. With a little bit of code you can create a API-Call to get a list of language links.
This python code transform a page URL into a URL to call the Wikimedia-API:
def apiUrlFromPageUrl(pageUrl : str) :
pageUrl = urlparse(pageUrl)
apiUrl= urlunparse((pageUrl.scheme,pageUrl.netloc, "w/api.php", "", "", ""))
return apiUrl
def pageNameFromPageUrl(pageUrl : str) :
path = urlparse(pageUrl).path
name = path.split("/").pop()
return name
def queryLanguageLinksFromPageUrl(pageUrl : str) :
# See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Blanglinks for API documentation
# i.E. https://en.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&titles=J._R._R._Tolkien&redirects=1&lllimit=max
queryString = "action=query&prop=langlinks&titles={0}&maxlag=200&redirects=&llprop=url&lllimit=max&format=json"
apiUrl = urlparse(apiUrlFromPageUrl(pageUrl))
queryUrl = apiUrl._replace(query=queryString.format(pageNameFromPageUrl(pageUrl)))
return urlunparse(queryUrl)
input = "https://de.wikipedia.org/wiki/J._R._R._Tolkien"
print(queryLanguageLinksFromPageUrl(input))
The result is a link like this: J.R.R. Tolkien Language Links
My apologies @sebastiansIT - my comment above wasn't very clear.
You are correct that simply swapping out the Wikipedia subdomain wouldn't be very smart. What I'm suggesting is that we can identify the subdomain for any Wikipedia links found in a given author's ISNI record. These links are known to be the correct link in a given language so we don't need to iterate through Wikipedia itself.
To stick with your example, this is the ISNI record for J.R.R. Tolkien.
You can see there are multiple Wikipedia links in the XML, like this:
<externalInformation>
<information>Wikipedia</information>
<URI>https://de.wikipedia.org/wiki/J._R._R._Tolkien</URI>
</externalInformation>
In #1581 referenced above, we query ISNI when searching for authors as part of the book editing function. There has also been some discussion of and work on ways to refresh data from external sources either on an ad hoc basis or via a regular script. This could be a way to collect all possible Wikipedia links at the same time, and save them to the database in a format that would allow for automatic display of the most relevant link for a given display language.
Ok, I see what you mean. So, there are two Situations:
If we have the first case we can use the ISNI ID to get Wikipedia URLs from the ISNI database and in the second case we can use Wikipedia API to get more Wikipedia URLs.
That sounds good to me
Wikipedia uses Wikidata as its multilingual database.
For example J. R. R. Tolkien's Wikidata is: https://www.wikidata.org/wiki/Q892 Here you can see its corresponding Wikipedia article links in English, Japanese, Chinese and other languages.
It would be easier if Bookwyrm could allow users to fill in Wikidata links directly, and then Bookwyrm would fetch Wikipedia links in various languages.
Wikidata also records links to Goodread, Librarythings, etc. corresponding to the item. Wikidata is an openly editable and accessible database.
PR #3275 adds a new wikidata field, in case this is needed for the suggested implementation.
While trying to understand how bookwyrm retrieves information from inventaire to populate fields from authors and books, I realised of something that might seem obvious to most, but not to me because I didn't know of inventaire until today: inventaire actually retrieves data from wikidata and the wikidata's id can be retrieved from inventaire's url. I do not know if that would mean that wikidata
field is redundant or that should try to pull data from inventaire, if present.
Not everything is on Inventaire so I think it makes sense to make it a separate field.
Excellent! In that case, I think the PR #3275 is ready for review (albeit very humble: it simply creates a text field labelled "wikidata" and displays it on the author's profile. Nothing else.)
I wouldn't know how to implement the rest of the features in this issue yet.
Is your feature request related to a problem? Please describe. The
wikipedia_url
for authors is a single text field. There are, however, many wikipedias rather than a single one. This means the language version of WIkipedia chosen by the last person to edit an author page is the one displayed to everyone, which may not be all that useful if you do not read in that language.Describe the solution you'd like If we make the wikipedia url field a list/array rather than a single varchar field, we could potentially match the link to the interface language dynamically. Wikipedia subdomains are generally aligned with the IANA language codes:
...etc
This means we could match the wikipedia version to the interface language, where possible, and otherwise provide a fallback. ISNI often provide multiple wikipedia urls for a single author, so we could pull this data in from there.
Describe alternatives you've considered Keeping the current functionality.
Additional context Somewhat related to #1581