Feature request: add wikisource

dpriskorn commented 3 years ago

Investgate license investigate api for getting wikitext Does the wikitext need to be cleaned? Is everything in the English Wikisource in English? Does it have structured data? How to find corresponding qid for a given page in a search result?

dpriskorn commented 3 years ago

License is compatible https://en.wikisource.org/wiki/The_First_Men_in_the_Moon/Chapter_1 has hit for "sit" how to find the qid? Split by / in the page name Lookup qid for [0] How do I do that in the api? https://stackoverflow.com/questions/37024807/how-to-get-wikidata-id-for-an-wikipedia-article-by-api Now we got the qid Add it using stated in. Should we add more information to pinpoint chapter for example? Yes, thats a good idea, parse the URL and find the chapter number: Split by / Search for chapter Extract number/title (remove the list item with chapter and join the rest of the list to a string) Add to https://www.wikidata.org/wiki/Property:P792 (chapter) Add url to reference url (https://www.wikidata.org/wiki/Property:P854)

dpriskorn commented 3 years ago

For some reason not all works have a qid. For example https://en.wikisource.org/wiki/Remarks_by_President_Trump_and_First_Lady_Melania_Trump_to_United_States_Military_Personnel_at_Naval_Air_Station_Sigonella We need to either fix that or ignore them.

belett commented 3 years ago

Hi,

Here some ideas (hopefully helpful)

Investigate license License is not a problem ; texts on Wikisource are PD most of the time and sometime under free license, but never copyright.
investigate api for getting wikitext did you look how Ordia does it? see for instance https://ordia.toolforge.org/L69-F1 when Wikisource is queried in SPARQL via the Mediawiki API
Does the wikitext need to be cleaned? Yes, templates and wikisyntax should be removed to keep the "raw" text (but I guess the API can do it? see above).
Is everything in the English Wikisource in English? Yes, almost. There is some dialectal and time variation (for instance Old English, Middle English), but it's close to English and it's a limited number of texts.
Does it have structured data? Kind of...
How to find corresponding qid for a given page in a search result? With difficulties. A lot of pages indeed don't have a Qid (and some say are not admissible enough for a Qid per WD:N... :/ ). Probably best to just skip them...

If you have more question, don't hesitate, I would love to help integrate more Wikisource into Lexemes!

dpriskorn commented 3 years ago

* investigate api for getting wikitext
  did you look how Ordia does it? see for instance https://ordia.toolforge.org/L69-F1 when Wikisource is queried in SPARQL via the Mediawiki API

Thanks for the tip! That makes my life much easier. @fnielsen is breaking new ground again! Much of what I have done is inspired by Ordia :)

This is the query for english wikisource:

 SELECT ?title ?titleUrl ?snippet WHERE {
  SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:api "Search" .
      bd:serviceParam wikibase:endpoint "en.wikisource.org" .
      bd:serviceParam mwapi:srsearch "test" .
      bd:serviceParam mwapi:language "en" .
      ?title wikibase:apiOutput mwapi:title .
      ?snippet_ wikibase:apiOutput "@snippet" .
  }
  hint:Prior hint:runFirst "true" .
  BIND(CONCAT("https://en.wikisource.org/wiki/", ENCODE_FOR_URI(?title)) AS ?titleUrl)
  BIND(REPLACE(REPLACE(?snippet_, '</span>', ''), '<span class="searchmatch">', '') AS ?snippet)
}
LIMIT 50

dpriskorn commented 3 years ago

Here some ideas (hopefully helpful)

Thanks they are most helpful :)

* Investigate license
  License is not a problem ; texts on Wikisource are PD most of the time and sometime under free license, but never copyright.
...

Does the wikitext need to be cleaned? Yes, templates and wikisyntax should be removed to keep the "raw" text (but I guess the API can do it? see above).

This needs to be investigated, I guess in a first POC version we can ignore this and tell the user to clean it manually.

* Is everything in the English Wikisource in English?
  Yes, almost.
  There is some dialectal and time variation (for instance Old English, Middle English), but it's close to English and it's a limited number of texts.

okej

* Does it have structured data?
  Kind of...

* How to find corresponding qid for a given page in a search result?
  With difficulties. A lot of pages indeed don't have a Qid (and some say are not admissible enough for a Qid per WD:N... :/ ). Probably best to just skip them...

We could look up the title via SPARQL and find them that way :) If no QID is returned we simply add: ref-> stated in -> English Wikisource Q15156406 reference URL -> https://en.wikisource.org/wiki/Remarks_by_President_Trump_and_First_Lady_Melania_Trump_to_United_States_Military_Personnel_at_Naval_Air_Station_Sigonella

But maybe that is not good enough, because we would want the current revision don't we? How do we find the latest rev easily?

The latest history id can be fetched using: https://en.wikisource.org/w/api.php?action=query&format=json&prop=revisions&titles=Remarks%20by%20President%20Trump%20and%20First%20Lady%20Melania%20Trump%20to%20United%20States%20Military%20Personnel%20at%20Naval%20Air%20Station%20Sigonella&rvprop=ids&rvlimit=1&rvdir=older

From WBI the it can be fetched using: https://github.com/LeMyst/WikibaseIntegrator/blob/36ef3ee86dc72761162bc912a4dee9f367598b8e/wikibaseintegrator/wbi_core.py#L938

title = "Remarks by President Trump and First Lady Melania Trump to United States Military Personnel at Naval Air Station Sigonella"
params = {
            'action': 'query',
            'prop': 'revisions',
            'titles': title,
            'format': 'json',
            'rvprop': 'ids',
            'rvlimit': 1,
            'rvdir': 'older'
        }
data = FunctionsEngine.mediawiki_api_call("GET", mediawiki_api_url="https://en.wikisource.org/w/api.php", data=params)
pages = data["query"]["pages"]
for page in pages:
revid = page["revisions][0]["revid"]
# we only want the first
break
url = # make the url and encode it

Which output:

{"continue":{"rvcontinue":"20170529200826|6837258","continue":"||"},"query":{"pages":{"2280959":{"pageid":2280959,"ns":0,"title":"Remarks by President Trump and First Lady Melania Trump to United States Military Personnel at Naval Air Station Sigonella","revisions":[{"revid":6837263,"parentid":6837258}]}}}}

revid is used to construct the url.

Reference: ref-> stated in -> English Wikisource Q15156406 page -> Remarks by President Trump and First Lady Melania Trump to United States Military Personnel at Naval Air Station Sigonella revision -> 6837263 reference URL -> https://en.wikisource.org/w/index.php?title=Remarks_by_President_Trump_and_First_Lady_Melania_Trump_to_United_States_Military_Personnel_at_Naval_Air_Station_Sigonella&oldid=6837263

If you have more question, don't hesitate, I would love to help integrate more Wikisource into Lexemes!

<3

dpriskorn / LexUse

Feature request: add wikisource #12