Closed HarelM closed 6 years ago
That has been really a problem. In the latest pre-release (v0.6.0-int10
), I added WikiPageStub.FromPageIds
that allows you to fetch a batch of WikiPageStub
s via their IDs. You can later use WikiPageStub.Title
to find out the page titles, and WikiPageStub.IsMissing
to determine whether the page is missing.
Note that to install int10
prerelease, you need to manually select the version. I surprisingly found out NuGet treat -int10
as a precedent version of -int2
. Perhaps I will need to name the next pre-release as intX1
instead.
While it might be possible that later WikiPage
might support fetching page information directly from a sequence of page IDs, I think it would be better if "fetching page via id" logic is separated explicitly from WikiPage
(in this case, in WikiPageStub.FromPageIds
method), because it may cause confusion if we are refreshing WikiPage
whose Id
and Title
are inconsistent. (e.g. when the page of that title happens to have been moved away, the page of that title will be missing, but the page of that old id still exists. Currently, the WikiPage
always refresh the page information by its Title
property.)
I see. Well I ended up understanding I need to run the following query:
https://he.wikipedia.org/w/api.php?format=json&action=query&pageids={pageid}&prop=extracts|pageimages|coordinates&explaintext=true&exintro=true&exsentences=1
and parse the response. Is there a way to do it with this library?
Released v0.6-intX1
.
Added two flags in PageQueryOptions. You can specify them when querying for the page. See the examples here. Still, you will need WikiPageStub.FromPageIds
to fetch the titles.
And… oops I forgot to implement pageimages
. Will do it in the next pre-release. You might need a placeholder image during this time.
No sure I fully understand your comment. More examples in the readme.md file would probably server me and others :-) I'm currently using a simple httpclient to do the work related to wikipedia until all my requirements are met. The following is also a query I use to fetch the locations of all the points I want to show (the previous query is when I want to present more info on a single point): Not sure how I can do it with this client library...
https://he.wikipedia.org/w/api.php?format=json&action=query&list=geosearch&gsradius=10000&gscoord=32|35&gslimit=1000
Fist of all, I'm not sure why my replys are ahead of yours, Harel. I cannot put it to the bottom…
More examples in the readme.md file would probably server me and others :-)
Sorry about that… I agree with you on the aspect of examples.
However, it might wear me out to work on a complete set of examples (it would drain much, much energy from me 🌚), especially when the library is not stable enough at this point. I wished demonstrating the usage of the library by unit tests is enough, but yes I doubt so. Perhaps I will try to
ConsoleTestApplication
.No sure I fully understand your comment.
As for your current problem, I meant, given a page ID, you can fetch for its geolocation and extract using
var stub = await WikiPageStub.FromPageIds(site, new[] {123}).First();
var page = new WikiPage(site, stub.Title);
await page.RefreshAsync(PageQueryOptions.FetchExtract | PageQueryOptions.FetchGeoCoordinate);
Console.WriteLine(page.Extract);
Console.WriteLine(page.PrimaryCoordinate);
If you are working with a sequence of page IDs:
var stubs = await WikiPageStub.FromPageIds(new[] {123, 456, 789}).ToList();
var pages = stubs.Select(s => WikiPage.FromTitle(site, s.Title)).ToList();
await pages.RefreshAsync(PageQueryOptions.FetchExtract | PageQueryOptions.FetchGeoCoordinate);
If you are doing geosearch, you can use GeoSearchGenerator
:
var gen = new GeoSearchGenerator(site) {TargetCoordinate = new GeoCoordinate(47.01, 2), Radius = 2000, PaginationSize = 20};
// Gets a sequence of `WikiPageStub`s in combination with `GeoCoordination` and distance
var results1 = await gen.EnumItemsAsync().Take(20).ToList();
// Gets a sequence of `WikiPage`s from the search result, with basic page information, but without geolocation information
var results2 = await gen.EnumPagesAsync().Take(20).ToList();
// Gets a sequence of `WikiPage`s with fetched geolocation and extract
var results3 = await gen.EnumPagesAsync(PageQueryOptions.FetchGeoCoordinate | PageQueryOptions.FetchExtract).Take(20).ToList();
// Print something out
Console.WriteLine(results3[0].PrimaryCoordinate);
Hope that helps…
Thanks for the detailed examples and code explanations! I'll take a look at it later tonight and let you know if it suites my needs.
Released v0.6-intX2
.
WikiPage(WikiSite site, int id)
constructor that allows you to initialize from page ID. Don't forget to call WikiPage.RefreshAsync
if you want to fetch other information from server.PageImagesPropertyProvider
and ExtractsPropertyProvider
to fetch for page image and extracts. These classes have properties that allows you to configure (e.g. how many sentences you'd like to fetch; how large should the thumbnails be; etc.)
I have also made a GeoSearch example, you can open this workbook with Xamarin Workbook, or simply paste the code in C# interactive window to test it out.
However, there are some changes to API involved in v0.6-intX2 that would break your code, such as now file upload needs to be performed via an extension method
using namespace WikiClientLibrary.Files;
WikiSite site;
await site.UploadAsync("Title.jpg", ......);
You can also take a look at the example here.
Thanks for the update! I'm using the following code to sweep over Israel to get all points. https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.API/Services/Poi/WikipediaPointsOfInterestAdapter.cs#L78 I'll try and migrate the relevant code to use the library with the latest changes but bare in mind that when I used my code with 1000 tasks at the same time I got a timeout exception from http client which I needed to make sure I catch because it's not a real timeout issue. See this code here: https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.DataAccess/WikipediaGateway.cs#L90
I think you might have initiated too many concurrent requests... The Wikipedia server might notice you and put a throttle on your requests. Why not queue up the tasks, or to put it more exact, use something like SemaphoreSlim
to implement some producer-consumer pattern, and to limit the concurreny?
Though it may take some more time, of course :new_moon_with_face:
The providers is a really neat way of getting the extra data dynamically, kudos on the idea! I was able to migrate my code and remove all the json object files, thanks again!
I was browsing thorough the code but could not find a way to get a page by its page Id. This can be probably be achieved by query generator but it seems a bit too complicated for what I believe should be part of the WikiPage constructor/factory. Am I missing something?