apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.46k stars 3.7k forks source link

Cannot retrieve proper tasks list from Overlord Task API with createdTimeInterval paramater #11589

Open humit0 opened 3 years ago

humit0 commented 3 years ago

Affected Version

0.21.1

Description

(docs)[https://druid.apache.org/docs/0.21.1/operations/api-reference.html#get-14]

From the documentation, /druid/indexer/v1/tasks API accept createdTimeInterval parameter.

If I call API without createdTimeInterval parameter, I can retrieve 1 task which createdTime is 2021-08-12T23:51:23.151Z.

curl -X http://{OVERLORD_HOST}:8090/druid/indexer/v1/tasks?datasource=new-data-source
[
    {
        "id": "index_parallel_new-data-source_mfchdpon_2021-08-12T23:51:23.142Z",
        "groupId": "index_parallel_new-data-source_mfchdpon_2021-08-12T23:51:23.142Z",
        "type": "index_parallel",
        "createdTime": "2021-08-12T23:51:23.151Z",
        "queueInsertionTime": "1970-01-01T00:00:00.000Z",
        "statusCode": "SUCCESS",
        "status": "SUCCESS",
        "runnerStatusCode": "NONE",
        "duration": 25546,
        "location": {
            "host": "{MIDDLE_MANAGER_HOST}",
            "port": 8100,
            "tlsPort": -1
        },
        "dataSource": "new-data-source",
        "errorMsg": null
    }
]

The createdTime of this task was 2021-08-12T23:51:23.151Z, so createdTimeInterval "2021-08-12T23:50:00.000Z_2021-08-13T00:00:00.000Z" should contain this task. But I retrieve empty task.

curl -X http://{OVERLORD_HOST}:8090/druid/indexer/v1/tasks?datasource=new-data-source&createdTimeInterval=2021-08-12T23:50:00.000Z_2021-08-13T00:00:00.000Z
[]

But when I specify createdTimeInterval "2021-01-01T00:00:00.000Z_2021-01-02T00:00:10.000Z", I can retrieve 1 task which createdTime was 2021-08-12T23:51:23.151Z

curl -X http://{OVERLORD_HOST}:8090/druid/indexer/v1/tasks?datasource=new-data-source&createdTimeInterval=2021-01-01T00:00:00.000Z_2021-01-02T00:00:10.000Z
[
    {
        "id": "index_parallel_new-data-source_mfchdpon_2021-08-12T23:51:23.142Z",
        "groupId": "index_parallel_new-data-source_mfchdpon_2021-08-12T23:51:23.142Z",
        "type": "index_parallel",
        "createdTime": "2021-08-12T23:51:23.151Z",
        "queueInsertionTime": "1970-01-01T00:00:00.000Z",
        "statusCode": "SUCCESS",
        "status": "SUCCESS",
        "runnerStatusCode": "NONE",
        "duration": 25546,
        "location": {
            "host": "{MIDDLE_MANAGER_HOST}",
            "port": 8100,
            "tlsPort": -1
        },
        "dataSource": "new-data-source",
        "errorMsg": null
    }
]

Search from code

When call task API with createdTimeInterval parameter, below code execute. It calculate time duration from time interval. https://github.com/apache/druid/blob/druid-0.21.1/indexing-service/src/main/java/org/apache/druid/indexing/overlord/http/OverlordResource.java#L616

      Duration createdTimeDuration = null;
      if (createdTimeInterval != null) {
        final Interval theInterval = Intervals.of(StringUtils.replace(createdTimeInterval, "_", "/"));
        createdTimeDuration = theInterval.toDuration();
      }
      final List<TaskInfo<Task, TaskStatus>> taskInfoList =
          taskStorageQueryAdapter.getCompletedTaskInfoByCreatedTimeDuration(maxCompletedTasks, createdTimeDuration, dataSource);

And getCompletedTaskInfoByCreatedTimeDuration call getRecentlyCreatedAlreadyFinishedTaskInfo method.

https://github.com/apache/druid/blob/druid-0.21.1/indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskStorageQueryAdapter.java#L61

  public List<TaskInfo<Task, TaskStatus>> getCompletedTaskInfoByCreatedTimeDuration(
      @Nullable Integer maxTaskStatuses,
      @Nullable Duration duration,
      @Nullable String dataSource
  )
  {
    return storage.getRecentlyCreatedAlreadyFinishedTaskInfo(maxTaskStatuses, duration, dataSource);
  }

getRecentlyCreatedAlreadyFinishedTaskInfo method is coping completed task list which createdTime is (now - duration) ~ (now). https://github.com/apache/druid/blob/druid-0.21.1/indexing-service/src/main/java/org/apache/druid/indexing/overlord/MetadataTaskStorage.java#L223

  @Override
  public List<TaskInfo<Task, TaskStatus>> getRecentlyCreatedAlreadyFinishedTaskInfo(
      @Nullable Integer maxTaskStatuses,
      @Nullable Duration durationBeforeNow,
      @Nullable String datasource
  )
  {
    return ImmutableList.copyOf(
        handler.getCompletedTaskInfo(
            DateTimes.nowUtc()
                     .minus(durationBeforeNow == null ? config.getRecentlyFinishedThreshold() : durationBeforeNow),
            maxTaskStatuses,
            datasource
        )
    );
  }

So I think create getCompletedTaskInfoByCreatedTimeInterval method from TaskStorageQueryAdapter class which arguments are maxTaskStatuses, interval, and dataSource.

FrankChen021 commented 3 years ago

The original PR(#5801) that introduced the parameter 'createdTimeInterval' took the interval parameter as duration.

From the user's side, I think 'interval' is the right semantic rather than duration because that would help us find out right tasks in given time range.

cc @jihoonson