crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Add method to get URL Status (returns an URLItem) #92

Closed klockla closed 1 week ago

klockla commented 1 month ago

Add a new API method to retrieve information about an URL

 /** Get status of a particular URL 
     This does not take into account URL scheduling.
     Used to check current status of an URL within the frontier
 **/
 rpc GetURLStatus(URLStatusRequest) returns (URLItem) {}

Implemented only for MemoryFrontier and RocksDb (may fullfill partially https://github.com/crawler-commons/url-frontier/issues/57 )

Unfortunately the internal storage doesn't make a distinction between Discovered and Known URLs which have to be refetched (or I have missed the point)

So all scheduled items will be returned as a KnownURLItem (with a refetch date equal to 0 for completed items) If the URL is not in URLFrontier, the method will return io.grpc.Status.NOT_FOUND.asRuntimeException()

Signed-off-by: Laurent Klock Laurent.Klock@arhs-cube.com

jnioche commented 2 weeks ago

Thanks @klockla Looks good at this stage but I think it needs an addition to the client so that we can query the new endpoint and display the status of a URL.

klockla commented 1 week ago

see comment in the conversation re-client side

Added the method in client.

jnioche commented 1 week ago

thanks a lot @klockla - I gave it a try and it seems to work fine let me know what you think of my comments and suggestions above

jnioche commented 1 week ago

Tested, works great! Thanks @klockla, this is a great contribution to the project