Open HarelM opened 4 years ago
I'm not sure if there's an easy API to know what was added and what was updated given a specific date and then I'll need to …
For updated pages, you can try recentchanges
generator, i.e. RecentChangesGenerator
. You can use RecentChangesGenerator.TypeFilters
to only include created pages. You can use StartTime
and EndTime
to specify a time range of interest.
Then you can use EnumPagesAsync
or its overload to retrieve a list of WikiPages
. Since you want to fetch GeoLocation, make sure to pass in a customized WikiPageQueryProvider
with GeoCoordinatesPropertyProvider
(see wiki), Then you should have the geolocation for enumerated pages.
However, I'm not sure what will happen if a page has been updated multiple times during your specified time range, when using EnumPagesAsync
. Alternatively, you may also try using RecentChangesGenerator.EnumItemsAsync
, remove the duplicates, then fetch for WikiPage
s manually. (See "batch fetching" on wiki).
Thanks, I'm not worried about multiple updates since I only need the pages' Ids. from there I send a request to get a page with all the fields I need in order to mirror it. I guess I'll start with the "dumb" approach - get all page ids that changed from a specific date and go over each Id to get the full page. If the page doesn't have geolocation I'll skip it. If this doesn't have good results I'll see if I can optimize it using the stuff above that you wrote...
Ok, So basically I need to choose which one to run first, i.e:
Both options require two steps as far as I understand in terms of getting the data and then filtering it. Since the BBox I need to query is relatively small I think the second option is faster. I just tried to get all the changes in Israel BBox in the last day in en wikipedia and it took 6 minutes just to get the list of pages which is a long time, I think... he wikipedia takes around 30 seconds to get the list of pages.
Run changes query and add coordinates to know that I'm within a bounding box.
This surely would be slow, as there are a lot of changes taking place on WP every minute (I didn't check the actual number).
Run a BBox query and check when was the last modification.
I think this approach is better, too. You just need to keep track of the coordinates of the pages since your last visit, so you can discover whether there are pages that have been move out of your BBox.
True, I'll need to track deleted pages and pages that moved out of the BBox.
Is there a way to add properties to the main query of GeoSearchGenerator
e.g. porp=revisions
?
Do I need to use EnumPagesAsync
for this? Will it then create another query for each page or will it use the main query and just add properties?
Thanks again for all the help and the super quick response!
The intention of EnumPagesAsync
is to provide a way for you to leverage MW "generators", i.e., retrieving page objects (like action=query&title=...
responses) from MW lists in a single API request, instead of retriving page titles/ids from MW lists, then sending another request (with action=query&title=...
) to retrive the page objects.
It's up to you to decide whether to leverage this method. Sometimes it may be worthwhile to use EnumItemsAsync
to fetch for page titles and ids only (and maybe some other list-specific properties), do some pre-processing on the list items (e,g, remove dups), then use a separate call (RefreshAsync
extension method) to fetch for the pages.
Is there a way to add properties to the main query of GeoSearchGenerator e.g. porp=revisions?
If you used EnumPagesAsync
, By default I suppose you should already have basic revision information of latest revision, excluding revision content. Do you need other information?
Generally, you can pass in a WikiPageQueryProvider
instance to EnumPagesAsync
method. And in the wiki page there is an example of how to construct a WikiPageQueryProvider
.
The plot thickens :-)
When using a GeoSearchGenerator
and EnumItemsAsync
I'm getting only 500 items for a specific area (related to #64) in the "he" wikipedia:
var delta = 0.15;
var results = _gateway.GetByBoundingBox(new Coordinate(34.75, 32), new Coordinate(34.75 + delta, 32 + delta), "he").Result;
When using almost the same code but with EnumPagesAsync
I'm getting 25000 results, most of them do not have coordinates - this means I can't really use EnumPagesAsync
with GeoSearchGenerator
:-(
See here:
var geoSearchGenerator = new GeoSearchGenerator(_wikiSites[language])
{
BoundingRectangle = GeoCoordinateRectangle.FromBoundingCoordinates(southWest.X, southWest.Y, northEast.X, northEast.Y),
PaginationSize = 500,
};
var results = await geoSearchGenerator//.EnumItemsAsync().ToListAsync();
.EnumPagesAsync(new WikiPageQueryProvider
{
Properties =
{
new ExtractsPropertyProvider {AsPlainText = true, IntroductionOnly = true, MaxSentences = 1},
new PageImagesPropertyProvider {QueryOriginalImage = true},
new GeoCoordinatesPropertyProvider {QueryPrimaryCoordinate = true},
new RevisionsPropertyProvider { FetchContent = false }
}
}).ToListAsync();
OK, I went a step further and checked the requests using fiddler. The following two requests give different results unfortunately - one is using a generator and the other is using a list. The one using a list (first one) gives only geo tagged pages while the other one gives other, probably not related pages... :-( https://he.wikipedia.org/w/api.php?format=json&action=query&maxlag=5&list=geosearch&gsradius=10&gsprimary=primary&gslimit=500&gsbbox=32.15%7C34.75%7C32%7C34.9
Seems like coordinates is added only to 10 pages out of the query - when the query page is 10 (the default) it works as expected, which is how the sandbox API shows the results, but when setting it to 500 it doesn't :-( This is not supersizing as it seems that the geosearch is not maintained well... https://stackoverflow.com/questions/35826469/how-to-combine-two-wikipedia-api-calls-into-one/35830161 https://stackoverflow.com/questions/24529853/how-to-get-more-info-within-only-one-geosearch-call-via-wikipedia-api/32916451
Seems like coordinates is added only to 10 pages out of the query
prop=coordinates
also has a pagination setting colimit
, with 10 as the default value. This value is used when you are using WikiPageGenerator
, as colimit
is not specified at all. This means you will have at most 10 coordinates per request, and there will be continuation (for coordinates list) in the MW API response. Though you may have more than 10 pages in the page results (generator=geosearch&ggslimit=500
), the pages beyond first 10 will have empty coordinates property result, awaiting you continue the query. Actually, there are 2 sets of continuation token (dual continuation) in the MW API response and frankly the WikiPageGenerator
my library cannot handle this case very well. However, you can evade such case by using EnumItemsAsync
and RefreshPagesAsync
.
{
"continue": {
"excontinue": 20,
"picontinue": 128475,
"cocontinue": "8670|13334822",
"continue": "||revisions"
},
"query": {
"pages": {
"1225": {
As I've mentioned in #69, there is some basic logic in RefreshPagesAsync
to merge prop list when there are some props need pagination (such as prop=coordinates
). Thus I think the best you can do for now is to use GeoSearchGenerator.EnumItemsAsync
, so you can have a list of page ids / titles. Then construct a IEnumerable<WikiPage>
sequence and call RefreshPagesAsync
on it.
Yea, I figured it out yesterday after digging into your code and seeing in fiddler that the number of pages that are scrolled is about 5 when doing a RefreshPagesAsync
for the properties I needed.
I have managed to reduce the time it takes to do the mirroring process to around 2 minutes which is very good from my point of view.
Code can be seen here:
https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.API/Services/Poi/WikipediaPointsOfInterestAdapter.cs#L81L96
I basically wrapped the refresh pages with a parallel loop since refresh pages is doing its job sequentially - i.e. when sending a lot of pages to be fetched and the page scroll is 10 or 5 it will take a long time to fetch all the pages (around 12K in my case).
It might be worth adding an option to parallel the refresh process in cases of low scroll value and high number of pages, not sure how it fits in the architecture of this project, but I found myself doing just that in the above code.
Thanks again for all the explanations and great library! Feel free to close this issue if you feel there's nothing to be done in this case.
My use cause: I'm using geosearch to get all the points in a certain area. For each point I get the extended data I need and store it in the database (A mirroring of some sort only with the data I truly need - pages with geo location). Later on I would like to know what items were updated or added from a specific point in time. I'm not sure if there's an easy API to know what was added and what was updated given a specific date and then I'll need to test which page has geo location or to get the revisions list of a geoserach results. In any case, I need to do a database incremental update given a specific date. Any advise would be welcome :-) I haven't found an option to add more props to geosearch generator. Here's an example to a query: /w/api.php?action=query&format=json&prop=coordinates%7Cpageimages%7Crevisions&generator=geosearch&ggscoord=37.7891838%7C-122.4033522 This page too: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=coordinates%7Cpageimages%7Crevisions&generator=geosearch&ggscoord=37.7891838%7C-122.4033522
I'm not sure this is the right solution though...