crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
45 stars 12 forks source link

Update API definition to follow protobuf conventions and best practices #110

Open jdpedrie opened 2 days ago

jdpedrie commented 2 days ago

Hello,

I'd like to propose that in the next major version of this project, the API definition be modified to follow conventions for protocol buffers established in AIP and protolint.

Some of the changes I made:

  1. All RPCs have corresponding Request and Response messages. This follows the convention of Google's Cloud APIs, which (nearly) always have dedicated message pairs for RPC requests/responses. This allows for independent evolution and extension without breaking changes. For that same reason, I avoid google.protobuf.Empty.
  2. Enums should be prefixed with their name, and the zero-value should be UNSPECIFIED. (https://google.aip.dev/126)
  3. Fixed formatting and comment format.
  4. Replaced UNIX timestamp with google.protobuf.Timestamp.

Thanks for providing this API and the reference implementation!

A sample:

/**
 * Licensed to Crawler-Commons under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * DigitalPebble licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

syntax = "proto3";

package urlfrontier;

import "google/protobuf/bool_value.proto";
import "google/protobuf/int64_value.proto";
import "google/protobuf/timestamp.proto";

option java_package = "crawlercommons.urlfrontier";
option go_package = "github.com/crawlercommons/url-frontier/v3";

service URLFrontier {
  // Return the list of nodes forming the cluster the current node belongs to.
  rpc ListNodes(ListNodesRequest) returns (ListNodesResponse);

  // Return the list of crawls handled by the frontier(s).
  rpc ListCrawls(ListCrawlsRequest) returns (ListCrawlsResponse);

  // Delete an entire crawl, returns the number of URLs removed this way.
  rpc DeleteCrawl(DeleteCrawlRequest) returns (DeleteCrawlResponse);

  // Return a list of queues for a specific crawl. Can chose whether to include
  // inactive queues (a queue is active if it has URLs due for fetching);
  // by default the service will return up to 100 results from offset 0 and
  // exclude inactive queues.
  rpc ListQueues(ListQueuesRequest) returns (ListQueuesResponse);

  // Stream URLs due for fetching from M queues with up to N items per queue.
  rpc GetURLs(GetURLsRequest) returns (stream GetURLsResponse);

  // Push URL items to the server; they get created (if they don't already
  // exist) in the case of DiscoveredURLItems or updated if KnownURLItems.
  rpc PutURLs(stream PutURLsRequest) returns (stream PutURLsResponse);

  // Return stats for a specific queue or an entire crawl. Does not aggregate
  // the stats across different crawlids.
  rpc GetStats(GetStatsRequest) returns (GetStatsResponse);

  // Delete the queue based on the key in parameter, returns the number of URLs
  // removed this way.
  rpc DeleteQueue(DeleteQueueRequest) returns (DeleteQueueResponse);

  // Block a queue from sending URLs until after the given timestamp. A
  // timestamp in the past will unblock the queue. The block will get removed
  // once the time indicated in argument is reached. This is useful for cases
  // where a server returns a Retry-After for instance.
  rpc BlockQueueUntil(BlockQueueUntilRequest) returns (BlockQueueUntilResponse);

  // Activate or deactivate the crawl. GetURLs will not return anything until
  // SetActive is set to true. PutURLs will still accept incoming data.
  rpc SetActive(SetActiveRequest) returns (SetActiveResponse);

  // Returns true if the crawl is active, false if it has been deactivated
  // with SetActive.
  rpc GetActive(GetActiveRequest) returns (GetActiveResponse);

  // Set a delay from a given queue. No URLs will be obtained via GetURLs for
  // this queue until the number of seconds specified has elapsed since the last
  // time URLs were retrieved. Usually informed by the delay setting of
  // robots.txt.
  rpc SetDelay(SetDelayRequest) returns (SetDelayResponse);

  // Overrides the log level for a given package.
  rpc SetLogLevel(SetLogLevelRequest) returns (SetLogLevelResponse);

  // Sets crawl limit for domain.
  rpc SetCrawlLimit(SetCrawlLimitRequest) returns (SetCrawlLimitResponse);

  // Get status of a particular URL. This does not take into account URL
  // scheduling. Used to check current status of an URL within the frontier.
  rpc GetURLStatus(GetURLStatusRequest) returns (GetURLStatusResponse);

  // List all URLs currently in the frontier. This does not take into account
  // URL scheduling. Used to check current status of all URLs within the
  // frontier.
  rpc ListURLs(ListURLsRequest) returns (stream ListURLsResponse);
}

// The request message for ListNodes.
message ListNodesRequest {}

// The response message for ListNodes.
message ListNodesResponse {
  repeated string values = 1;
}

// The request message for ListCrawls.
message ListCrawlsRequest {}

// The response message for ListCrawls.
message ListCrawlsResponse {
  repeated string values = 1;
}

// The request message for DeleteCrawl.
message DeleteCrawlRequest {
  string crawl_id = 1;
  bool local = 2;
}

// The response message for DeleteCrawl.
message DeleteCrawlResponse {
  google.protobuf.Int64Value urls_removed = 1;
}

// The request message for ListQueues.
message ListQueuesRequest {
  uint32 start = 1;
  uint32 size = 2;
  bool include_inactive = 3;
  string crawl_id = 4;
  bool local = 5;
}

// The response message for ListQueues.
message ListQueuesResponse {
  repeated string values = 1;
  uint64 total = 2;
  uint32 start = 3;
  uint32 size = 4;
  string crawl_id = 5;
}

// The request message for GetURLs.
message GetURLsRequest {
  uint32 max_urls_per_queue = 1;
  uint32 max_queues = 2;
  string key = 3;
  uint32 delay_requestable = 4;
  oneof crawl {
    bool any_crawl_id = 5;
    string crawl_id = 6;
  }
}

// The response message for GetURLs.
message GetURLsResponse {
  URLInfo url = 1;
}

// The request message for PutURLs.
message PutURLsRequest {
  URLInfo url = 1;
}

// The response message for PutURLs.
message PutURLsResponse {
  AckMessage ack = 1;
}

// The request message for GetStats.
message GetStatsRequest {
  string key = 1;
  string crawl_id = 2;
  bool local = 3;
}

// The response message for GetStats.
message GetStatsResponse {
  uint64 size = 1;
  uint32 in_process = 2;
  map<string, uint64> counts = 3;
  uint64 number_of_queues = 4;
  string crawl_id = 5;
}

// The request message for DeleteQueue.
message DeleteQueueRequest {
  string key = 1;
  string crawl_id = 2;
  bool local = 3;
}

// The response message for DeleteQueue.
message DeleteQueueResponse {
  google.protobuf.Int64Value urls_removed = 1;
}

// The request message for BlockQueueUntil.
message BlockQueueUntilRequest {
  string key = 1;
  google.protobuf.Timestamp time = 2;
  string crawl_id = 3;
  bool local = 4;
}

// The response message for BlockQueueUntil.
message BlockQueueUntilResponse {
  // empty
}

// The request message for SetActive.
message SetActiveRequest {
  bool state = 1;
  bool local = 2;
}

// The response message for SetActive.
message SetActiveResponse {
  // empty
}

// The request message for GetActive.
message GetActiveRequest {
  bool local = 1;
}

// The response message for GetActive.
message GetActiveResponse {
  google.protobuf.BoolValue state = 1;
}

// The request message for SetDelay.
message SetDelayRequest {
  string key = 1;
  uint32 delay_requestable = 2;
  string crawl_id = 3;
  bool local = 4;
}

// The response message for SetDelay.
message SetDelayResponse {
  // empty
}

// The request message for SetLogLevel.
message SetLogLevelRequest {
  string package = 1;
  Level level = 2;
  bool local = 3;

  enum Level {
    LEVEL_UNSPECIFIED = 0;
    LEVEL_TRACE = 1;
    LEVEL_DEBUG = 2;
    LEVEL_INFO = 3;
    LEVEL_WARN = 4;
    LEVEL_ERROR = 5;
  }
}

// The response message for SetLogLevel.
message SetLogLevelResponse {
  // empty
}

// The request message for SetCrawlLimit.
message SetCrawlLimitRequest {
  string key = 1;
  uint32 limit = 2;
  string crawl_id = 3;
}

// The response message for SetCrawlLimit.
message SetCrawlLimitResponse {
  // empty
}

// The request message for GetURLStatus.
message GetURLStatusRequest {
  string url = 1;
  string key = 2;
  string crawl_id = 3;
}

// The response message for GetURLStatus.
message GetURLStatusResponse {
  URLItem url = 1;
}

// The request message for ListURLs.
message ListURLsRequest {
  uint32 start = 1;
  uint32 size = 2;
  string key = 3;
  string crawl_id = 4;
  bool local = 5;
}

// The response message for ListURLs.
message ListURLsResponse {
  URLItem url = 1;
}

message URLItem {
  oneof item {
    DiscoveredURLItem discovered = 1;
    KnownURLItem known = 2;
  }
  string id = 3;
}

message AckMessage {
  string id = 1;
  Status status = 2;

  enum Status {
    STATUS_UNSPECIFIED = 0;
    STATUS_OK = 1;
    STATUS_SKIPPED = 2;
    STATUS_FAIL = 3;
  }
}

message URLInfo {
  string url = 1;
  string key = 2;
  map<string, StringList> metadata = 3;
  string crawl_id = 4;
}

message KnownURLItem {
  URLInfo info = 1;
  uint64 refetchable_from_date = 2;
}

message DiscoveredURLItem {
  URLInfo info = 1;
}

message StringList {
  repeated string values = 1;
}
jnioche commented 1 day ago

Thanks @jdpedrie definitely worth considering Would you be interested in submitting a PR? This would make it easier to compare with what we currently have. BTW do you use URLFrontier? Would be great to hear about it if possible

jdpedrie commented 1 day ago

Ah, that's a good idea. Did that in #111.

I'm currently evaluating url frontier for use in a new project. We've used storm crawler in production for a number of years, but it's still a version from prior to the implementation of this project.

Is the reference implementation suitable for production use?

jnioche commented 22 hours ago

I'm currently evaluating url frontier for use in a new project. We've used storm crawler in production for a number of years, but it's still a version from prior to the implementation of this project.

Great to hear you use StormCrawler! (even more curious about what you sue it for, scale etc...). What backend do you currently have with it?

Is the reference implementation suitable for production use? I haven't used it in production. I know that OpenWebSearch use URLFrontier but with a different implementation. @klockla uses the RockDB one. Not sure about @zaibacu

jdpedrie commented 14 hours ago

We use it to power our news search feature in Freespoke. It's backed by elasticsearch.