CXuesong / WikiClientLibrary

/*🌻*/ Wiki Client Library is an asynchronous MediaWiki API client library targeting modern .NET platforms
https://github.com/CXuesong/WikiClientLibrary/wiki
Apache License 2.0
82 stars 16 forks source link

Get page by id - simplified #32

Closed HarelM closed 6 years ago

HarelM commented 7 years ago

I was browsing thorough the code but could not find a way to get a page by its page Id. This can be probably be achieved by query generator but it seems a bit too complicated for what I believe should be part of the WikiPage constructor/factory. Am I missing something?

CXuesong commented 7 years ago

That has been really a problem. In the latest pre-release (v0.6.0-int10), I added WikiPageStub.FromPageIds that allows you to fetch a batch of WikiPageStubs via their IDs. You can later use WikiPageStub.Title to find out the page titles, and WikiPageStub.IsMissing to determine whether the page is missing.

Note that to install int10 prerelease, you need to manually select the version. I surprisingly found out NuGet treat -int10 as a precedent version of -int2. Perhaps I will need to name the next pre-release as intX1 instead.

While it might be possible that later WikiPage might support fetching page information directly from a sequence of page IDs, I think it would be better if "fetching page via id" logic is separated explicitly from WikiPage (in this case, in WikiPageStub.FromPageIds method), because it may cause confusion if we are refreshing WikiPage whose Id and Title are inconsistent. (e.g. when the page of that title happens to have been moved away, the page of that title will be missing, but the page of that old id still exists. Currently, the WikiPage always refresh the page information by its Title property.)

HarelM commented 7 years ago

I see. Well I ended up understanding I need to run the following query:

https://he.wikipedia.org/w/api.php?format=json&action=query&pageids={pageid}&prop=extracts|pageimages|coordinates&explaintext=true&exintro=true&exsentences=1

and parse the response. Is there a way to do it with this library?

CXuesong commented 7 years ago

Released v0.6-intX1.

Added two flags in PageQueryOptions. You can specify them when querying for the page. See the examples here. Still, you will need WikiPageStub.FromPageIds to fetch the titles.

And… oops I forgot to implement pageimages. Will do it in the next pre-release. You might need a placeholder image during this time.

HarelM commented 7 years ago

No sure I fully understand your comment. More examples in the readme.md file would probably server me and others :-) I'm currently using a simple httpclient to do the work related to wikipedia until all my requirements are met. The following is also a query I use to fetch the locations of all the points I want to show (the previous query is when I want to present more info on a single point): Not sure how I can do it with this client library...

https://he.wikipedia.org/w/api.php?format=json&action=query&list=geosearch&gsradius=10000&gscoord=32|35&gslimit=1000
CXuesong commented 7 years ago

Fist of all, I'm not sure why my replys are ahead of yours, Harel. I cannot put it to the bottom…

More examples in the readme.md file would probably server me and others :-)

Sorry about that… I agree with you on the aspect of examples.

However, it might wear me out to work on a complete set of examples (it would drain much, much energy from me 🌚), especially when the library is not stable enough at this point. I wished demonstrating the usage of the library by unit tests is enough, but yes I doubt so. Perhaps I will try to

No sure I fully understand your comment.

As for your current problem, I meant, given a page ID, you can fetch for its geolocation and extract using

var stub = await WikiPageStub.FromPageIds(site, new[] {123}).First();
var page = new WikiPage(site, stub.Title);
await page.RefreshAsync(PageQueryOptions.FetchExtract | PageQueryOptions.FetchGeoCoordinate);

Console.WriteLine(page.Extract);
Console.WriteLine(page.PrimaryCoordinate);

If you are working with a sequence of page IDs:

var stubs = await WikiPageStub.FromPageIds(new[] {123, 456, 789}).ToList();
var pages = stubs.Select(s => WikiPage.FromTitle(site, s.Title)).ToList();
await pages.RefreshAsync(PageQueryOptions.FetchExtract | PageQueryOptions.FetchGeoCoordinate);

If you are doing geosearch, you can use GeoSearchGenerator:

var gen = new GeoSearchGenerator(site) {TargetCoordinate = new GeoCoordinate(47.01, 2), Radius = 2000, PaginationSize = 20};
// Gets a sequence of `WikiPageStub`s in combination with `GeoCoordination` and distance
var results1 = await gen.EnumItemsAsync().Take(20).ToList();
// Gets a sequence of `WikiPage`s from the search result, with basic page information, but without geolocation information
var results2 = await gen.EnumPagesAsync().Take(20).ToList();
// Gets a sequence of `WikiPage`s with fetched geolocation and extract
var results3 = await gen.EnumPagesAsync(PageQueryOptions.FetchGeoCoordinate | PageQueryOptions.FetchExtract).Take(20).ToList();

// Print something out
Console.WriteLine(results3[0].PrimaryCoordinate);

Hope that helps…

HarelM commented 7 years ago

Thanks for the detailed examples and code explanations! I'll take a look at it later tonight and let you know if it suites my needs.

CXuesong commented 6 years ago

Released v0.6-intX2.

I have also made a GeoSearch example, you can open this workbook with Xamarin Workbook, or simply paste the code in C# interactive window to test it out.

However, there are some changes to API involved in v0.6-intX2 that would break your code, such as now file upload needs to be performed via an extension method

using namespace WikiClientLibrary.Files;
WikiSite site;
await site.UploadAsync("Title.jpg", ......);

You can also take a look at the example here.

HarelM commented 6 years ago

Thanks for the update! I'm using the following code to sweep over Israel to get all points. https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.API/Services/Poi/WikipediaPointsOfInterestAdapter.cs#L78 I'll try and migrate the relevant code to use the library with the latest changes but bare in mind that when I used my code with 1000 tasks at the same time I got a timeout exception from http client which I needed to make sure I catch because it's not a real timeout issue. See this code here: https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.DataAccess/WikipediaGateway.cs#L90

CXuesong commented 6 years ago

I think you might have initiated too many concurrent requests... The Wikipedia server might notice you and put a throttle on your requests. Why not queue up the tasks, or to put it more exact, use something like SemaphoreSlim to implement some producer-consumer pattern, and to limit the concurreny?

Though it may take some more time, of course :new_moon_with_face:

HarelM commented 6 years ago

The providers is a really neat way of getting the extra data dynamically, kudos on the idea! I was able to migrate my code and remove all the json object files, thanks again!