bookwyrm-social / bookwyrm

Social reading and reviewing, decentralized with ActivityPub
http://joinbookwyrm.com/
Other
2.28k stars 267 forks source link

Make wikipedia url a list rather than a single value #1585

Open hughrun opened 3 years ago

hughrun commented 3 years ago

Is your feature request related to a problem? Please describe. The wikipedia_url for authors is a single text field. There are, however, many wikipedias rather than a single one. This means the language version of WIkipedia chosen by the last person to edit an author page is the one displayed to everyone, which may not be all that useful if you do not read in that language.

Describe the solution you'd like If we make the wikipedia url field a list/array rather than a single varchar field, we could potentially match the link to the interface language dynamically. Wikipedia subdomains are generally aligned with the IANA language codes:

...etc

This means we could match the wikipedia version to the interface language, where possible, and otherwise provide a fallback. ISNI often provide multiple wikipedia urls for a single author, so we could pull this data in from there.

Describe alternatives you've considered Keeping the current functionality.

Additional context Somewhat related to #1581

sebastiansIT commented 2 years ago

Hi @hughrun,

auto matching in't a good idea. In some languages the pages are named differently. If you only change the third level domain name you get wrong links.

But you can, starting with one URL, automatically get all the others. With a little bit of code you can create a API-Call to get a list of language links.

This python code transform a page URL into a URL to call the Wikimedia-API:


def apiUrlFromPageUrl(pageUrl : str) :
    pageUrl = urlparse(pageUrl)
    apiUrl= urlunparse((pageUrl.scheme,pageUrl.netloc, "w/api.php", "", "", ""))
    return apiUrl

def pageNameFromPageUrl(pageUrl : str) :
    path = urlparse(pageUrl).path
    name = path.split("/").pop()
    return name

def queryLanguageLinksFromPageUrl(pageUrl : str) :
    # See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Blanglinks for API documentation
    # i.E. https://en.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&titles=J._R._R._Tolkien&redirects=1&lllimit=max
    queryString = "action=query&prop=langlinks&titles={0}&maxlag=200&redirects=&llprop=url&lllimit=max&format=json"
    apiUrl = urlparse(apiUrlFromPageUrl(pageUrl))
    queryUrl = apiUrl._replace(query=queryString.format(pageNameFromPageUrl(pageUrl)))
    return urlunparse(queryUrl)

input = "https://de.wikipedia.org/wiki/J._R._R._Tolkien"
print(queryLanguageLinksFromPageUrl(input))

The result is a link like this: J.R.R. Tolkien Language Links

hughrun commented 2 years ago

My apologies @sebastiansIT - my comment above wasn't very clear.

You are correct that simply swapping out the Wikipedia subdomain wouldn't be very smart. What I'm suggesting is that we can identify the subdomain for any Wikipedia links found in a given author's ISNI record. These links are known to be the correct link in a given language so we don't need to iterate through Wikipedia itself.

To stick with your example, this is the ISNI record for J.R.R. Tolkien.

You can see there are multiple Wikipedia links in the XML, like this:

<externalInformation>
<information>Wikipedia</information>
<URI>https://de.wikipedia.org/wiki/J._R._R._Tolkien</URI>
</externalInformation>

In #1581 referenced above, we query ISNI when searching for authors as part of the book editing function. There has also been some discussion of and work on ways to refresh data from external sources either on an ad hoc basis or via a regular script. This could be a way to collect all possible Wikipedia links at the same time, and save them to the database in a format that would allow for automatic display of the most relevant link for a given display language.

sebastiansIT commented 2 years ago

Ok, I see what you mean. So, there are two Situations:

  1. In the Bookwyrm database a ISNI ID is available.
  2. In the Bookwyrm database one Wikipedia URL is available

If we have the first case we can use the ISNI ID to get Wikipedia URLs from the ISNI database and in the second case we can use Wikipedia API to get more Wikipedia URLs.

That sounds good to me

Guanchishan commented 2 years ago

Wikipedia uses Wikidata as its multilingual database.

For example J. R. R. Tolkien's Wikidata is: https://www.wikidata.org/wiki/Q892 Here you can see its corresponding Wikipedia article links in English, Japanese, Chinese and other languages.

It would be easier if Bookwyrm could allow users to fill in Wikidata links directly, and then Bookwyrm would fetch Wikipedia links in various languages.

Wikidata also records links to Goodread, Librarythings, etc. corresponding to the item. Wikidata is an openly editable and accessible database.

ccamara commented 10 months ago

PR #3275 adds a new wikidata field, in case this is needed for the suggested implementation.

ccamara commented 10 months ago

While trying to understand how bookwyrm retrieves information from inventaire to populate fields from authors and books, I realised of something that might seem obvious to most, but not to me because I didn't know of inventaire until today: inventaire actually retrieves data from wikidata and the wikidata's id can be retrieved from inventaire's url. I do not know if that would mean that wikidata field is redundant or that should try to pull data from inventaire, if present.

hughrun commented 10 months ago

Not everything is on Inventaire so I think it makes sense to make it a separate field.

ccamara commented 10 months ago

Excellent! In that case, I think the PR #3275 is ready for review (albeit very humble: it simply creates a text field labelled "wikidata" and displays it on the author's profile. Nothing else.)

I wouldn't know how to implement the rest of the features in this issue yet.