Open klockla opened 1 month ago
I see that this contains the changes from #92, it would be better to make it independent if possible (I appreciate that this has a draft status)
I rebased and updated this PR, think it's ready to be reviewed now.
I had misread what this does I thought it would return all the URLs within a queue, which would be useful for debugging.
This will stream the entire content of the frontier by pages of e.g. 100 URLs. What do you want to be able to achieve with this? backups? debugs?
Should we consider an export
endpoint instead? If so, what format would it use to represent all the data including the metadata? Should the service implementations be responsible for writing the data and communicating the location of the file to the client?
It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)
It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)
thanks for the explanation @klockla. in a sense PutURLs is the import feature but we could have one where a file is made available to the server locally and it could ingest it as a batch. This would be quicker than streaming from the client; I think @michaeldinzinger did something similar with the OpenSearch backend he uses at OpenWebSearch.
Going back to our primary topic, would it be OK to list the URLs per queue only? From a client's perspective you can list the queues separately and get the URLs for each one. This would be equivalent to paging in a sense.
Doing so would make more sense to me as in most cases you'd want to debug per queue. What do you think?
What do you think of the following:
We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).
Parameters would look something like:
message ListUrlParams { // position of the first result in the list; defaults to 0 uint32 start = 1; // max number of values; defaults to 100 uint32 size = 2; / ID for the queue / string key = 3; // crawl ID string crawlID = 4; // only for the current local instance bool local = 5; }
What do you think of the following:
We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).
Parameters would look something like:
message ListUrlParams { // position of the first result in the list; defaults to 0 uint32 start = 1; // max number of values; defaults to 100 uint32 size = 2; / ID for the queue / string key = 3; // crawl ID string crawlID = 4; // only for the current local instance bool local = 5; }
yes that would work I think
Updated the PR to include queue parameter
Added method ListURLs in API and client to list all URLs known in the frontier with their next fetch date (RocksDB & MemoryFrontier implementations only)