crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Add method ListURLs to list all URLs known in the frontier with their next fetch date #93

Open klockla opened 1 month ago

klockla commented 1 month ago

Added method ListURLs in API and client to list all URLs known in the frontier with their next fetch date (RocksDB & MemoryFrontier implementations only)

jnioche commented 2 weeks ago

I see that this contains the changes from #92, it would be better to make it independent if possible (I appreciate that this has a draft status)

klockla commented 1 week ago

I rebased and updated this PR, think it's ready to be reviewed now.

jnioche commented 1 week ago

I had misread what this does I thought it would return all the URLs within a queue, which would be useful for debugging. This will stream the entire content of the frontier by pages of e.g. 100 URLs. What do you want to be able to achieve with this? backups? debugs? Should we consider an export endpoint instead? If so, what format would it use to represent all the data including the metadata? Should the service implementations be responsible for writing the data and communicating the location of the file to the client?

klockla commented 1 week ago

It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)

jnioche commented 1 week ago

It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)

thanks for the explanation @klockla. in a sense PutURLs is the import feature but we could have one where a file is made available to the server locally and it could ingest it as a batch. This would be quicker than streaming from the client; I think @michaeldinzinger did something similar with the OpenSearch backend he uses at OpenWebSearch.

Going back to our primary topic, would it be OK to list the URLs per queue only? From a client's perspective you can list the queues separately and get the URLs for each one. This would be equivalent to paging in a sense.

Doing so would make more sense to me as in most cases you'd want to debug per queue. What do you think?

klockla commented 1 week ago

What do you think of the following:

We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).

Parameters would look something like:

message ListUrlParams { // position of the first result in the list; defaults to 0 uint32 start = 1; // max number of values; defaults to 100 uint32 size = 2; / ID for the queue / string key = 3; // crawl ID string crawlID = 4; // only for the current local instance bool local = 5; }

jnioche commented 1 week ago

What do you think of the following:

We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).

Parameters would look something like:

message ListUrlParams { // position of the first result in the list; defaults to 0 uint32 start = 1; // max number of values; defaults to 100 uint32 size = 2; / ID for the queue / string key = 3; // crawl ID string crawlID = 4; // only for the current local instance bool local = 5; }

yes that would work I think

klockla commented 6 days ago

Updated the PR to include queue parameter