IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

CSM BDS python scripts don't work with --starttime or --endtime #992

Open thanh-lam opened 3 years ago

thanh-lam commented 3 years ago

Describe the bug Regardless of the date entered with --starttime or --endtime, a script queries and returns all job or allocation records. The script findUserJobs.py is used here as example. All other scripts that support --starttime and --endtime options have similar issue.

To Reproduce Steps to reproduce the behavior:

  1. Go to /opt/ibm/csm/bigdata/python/
  2. Run one of the scripts like ./findUserJobs.py --starttime 2021-01-12 to search for jobs started on that date.
  3. See the command output with all jobs even at a different start time.
  4. No error

Expected behavior For these scripts to work correctly, they should only return the jobs within the time ranges specified with --starttime and|or --endtime.

Screenshots

[/opt/ibm/csm/bigdata/python]> export CAST_ELASTIC="...:9200"
[/opt/ibm/csm/bigdata/python]> ./findUserJobs.py -u tlam --starttime 2021-01-12
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
    failed |    12 |      112 | 0        | 2020-09-23 15:33:19.868194 | 2020-09-23 15:34:16.817299
    failed |    13 |      113 | 0        | 2020-09-23 15:40:02.624777 | 2020-09-23 15:40:02.845004
    failed |    14 |      114 | 0        | 2020-09-23 15:44:55.221521 | 2020-09-23 15:44:55.431779
...
  complete |    96 |      195 | 0        | 2020-09-28 11:43:35.973273 | 2020-09-28 11:44:28.118919
  complete |    97 |      196 | 0        | 2020-09-28 12:42:20.720108 | 2020-09-28 12:42:21.852965
  complete |    98 |      197 | 0        | 2020-09-28 12:43:17.145406 | 2020-09-28 12:43:21.830506
  complete |    99 |      198 | 0        | 2020-09-28 12:44:37.372068 | 2020-09-28 12:44:38.463995
    failed |   100 |      299 | 0        | 2020-10-08 09:52:26.881603 | 2020-10-08 09:54:54.950164

I added some debugged prints in the script to show the time range that it created from the --starttime input.

Target list:  [{'range': {'data.begin_time': {'format': 'epoch_millis', 'lte': '1610487082000'}}}, {'range': {'data.history.end_time': {'format': 'epoch_millis', 'gte': '1610427600000'}}}, {'bool': {'must_not': {'exists': {'field': 'data.history.end_time'}}}}] Match min:  2

And the query the script will send to elasticsearch:

Query body:  {'query': {'bool': {'must': [{'match': {'data.user_name': 'tlam'}}], 'should': [{'range': {'data.begin_time': {'format': 'epoch_millis', 'lte': '1610487082000'}}}, {'range': {'data.history.end_time': {'format': 'epoch_millis', 'gte': '1610427600000'}}}, {'bool': {'must_not': {'exists': {'field': 'data.history.end_time'}}}}], 'minimum_should_match': 2}}, '_source': ['data.primary_job_id', 'data.secondary_job_id', 'data.allocation_id', 'data.user_name', 'data.begin_time', 'data.history.end_time', 'data.state'], 'size': 1000}

The time ranges and the query look correct. However, the problem is with the two fields data.begin_time and data.history.end_time which have type "text" or "string". For example:

GET /cast-allocation/_mapping/field/data.begin_time
{
  "cast-allocation" : {
    "mappings" : {
      "data.begin_time" : {
        "full_name" : "data.begin_time",
        "mapping" : {
          "begin_time" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }

Environment (please complete the following information):

Additional context Add any other context about the problem here.

Issue Source: CSM regression tests

thanh-lam commented 3 years ago

A change made in the query to replace data.begin_time and data.history.end_time with @timestamp which is a field with date type proved the time range search worked on a date field but not a text field.

Query body:  {'query': {'bool': {'must': [{'match': {'data.user_name': 'tlam'}}], 'should': [{'range': {'@timestamp': {'format': 'epoch_millis', 'lte': '1610487082000'}}}, {'range': {'@timestamp': {'format': 'epoch_millis', 'gte': '1610427600000'}}}, {'bool': {'must_not': {'exists': {'field': 'data.history.end_time'}}}}], 'minimum_should_match': 2}}, '_source': ['data.primary_job_id', 'data.secondary_job_id', 'data.allocation_id', 'data.user_name', 'data.begin_time', 'data.history.end_time', 'data.state'], 'size': 1000}

However, @timestamp is not exactly data.begin_time and data.history.end_time.

A possible workaround for the problem is to create a new index that has the same fields as cast-allocation only with the two fields data.begin_time and data.history.end_time having type date. Then, an elasticsearch reindex can be run to copy the data from cast-allocation to the new index.

But, the new index has a different name. Therefore, the current cast-allocation index may have to be deleted and recreated anew. Then, data can be copied from the new index to the new cast-allocation index. That will take some work but possible.

Never the less, the reindex process also hit a problem with copying the text string to the date type.

POST _reindex
{
  "source": {
    "index": "cast-allocation"
  },
  "dest": {
    "index": "cast-allocation-test"
  }
}

That caused Exceptions:

{
  "took": 273,
  "timed_out": false,
  "total": 252,
  "updated": 0,
  "created": 0,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": [
    {
      "index": "cast-allocation-test",
      "type": "_doc",
      "id": "193",
      "cause": {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [data.begin_time] of type [date] in document with id '193'. Preview of field's value: '2020-11-02 13:04:22.413672'",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "failed to parse date field [2020-11-02 13:04:22.413672] with format [YYYY-MM-dd HH:mm:ss.000000]",
          "caused_by": {
            "type": "date_time_parse_exception",
            "reason": "Text '2020-11-02 13:04:22.413672' could not be parsed at index 20"
          }
        }
      },
      "status": 400
    },
...

Pending for further investigation......

thanh-lam commented 3 years ago

For debugging purposes, we deleted the cast-allocation index in elasticsearch. Then, a new cast-allocation was created with the two fields (begin_time, data.end_time) now have type "date".

With this new index, the incoming data also needs to have the correct date format. The data sources for cast-allocation are from CSM transaction log, which can be manually edited to get the two fields in the correct format. For example:

{"type":"allocation","traceid":531084125,"uid":1,"timestamp":"2020-09-17T16:06:01Z","data":{"allocation_id":1,"begin_time":"2020-09-17 12:06:01.183794","primary_job_id":526,"secondary_job_id":0,"ssd_file_system_name":"","launch_node_name":"f5n06","user_flags":"","system_flags":"","ssd_min":0,"ssd_max":0,"num_nodes":1,"num_processors":0,"num_gpus":0,"projected_memory":0,"state":"running","type":"jsm","job_type":"batch","user_name":"root","user_id":0,"user_group_id":0,"user_script":"1 -q normal sleep 3333333333","account":"","comment":"","job_name":"","job_submit_time":"2020-09-17 12:06:57","queue":"normal","requeue":"","time_limit":2147483647,"wc_key":"","isolated_cores":0,"compute_nodes":["f10n16"]} }
{"type":"allocation","traceid":231640295,"uid":2,"timestamp":"2020-09-17T16:06:02Z","data":{"allocation_id":2,"begin_time":"2020-09-17 12:06:02.483204","primary_job_id":527,"secondary_job_id":0,"ssd_file_system_name":"","launch_node_name":"f5n06","user_flags":"","system_flags":"","ssd_min":0,"ssd_max":0,"num_nodes":1,"num_processors":0,"num_gpus":0,"projected_memory":0,"state":"running","type":"jsm","job_type":"batch","user_name":"root","user_id":0,"user_group_id":0,"user_script":"1 -q normal sleep 3333333333","account":"","comment":"","job_name":"","job_submit_time":"2020-09-17 12:06:59","queue":"normal","requeue":"","time_limit":2147483647,"wc_key":"","isolated_cores":0,"compute_nodes":["f10n15"]} }
{"type":"allocation","traceid":231640295,"uid":2,"timestamp":"2020-09-17T16:06:55Z","data":{"running-start-timestamp":"2020-09-17 12:06:55.93243"}}

A T letter has to be added to replace the space between date and time:

"begin_time":"2020-09-17 12:06:01.183794" ==> "begin_time":"2020-09-17T12:06:01.183794"

As a result, following simple script shows that the time range query works:

=================================
$ cat timezone-query.sh
curl -X GET "XX.X.X.XX:9200/cast-allocation/_search?pretty" -H 'Content-Type: application/json' -d'
{
   "query" : {
      "range" : {
         "data.begin_time" : {
         "time_zone": "-05:00",
         "gte" : "2021-02-05T14:11:28.578Z",
         "lte" : "now",
         "relation": "within"
         }
      }
   }
}
'
======================================
thanh-lam commented 3 years ago

Another script that mimics how the python scripts query for "--starttime" and "--endtime" also works.

$ cat startend-epochmil-query.sh
curl -X GET "XX.X.X.XX:9200/cast-allocation/_search?pretty" -H 'Content-Type: application/json' -d'
{"query":
   {"bool":
      {"must":
         [{"match":
             {"data.user_name":"wcmorris"}
         }],
       "should":
         [{"range":
             {"data.begin_time":
                 {"format":"epoch_millis",
                  "lte":1612562710000
                 }
             }
          },
          {"range":
             {"data.history.end_time":
                 {"format":"epoch_millis",
                  "gte":1612450800000
                 }
             }
          },
          {"bool":
             {"must_not":
                 {"exists":
                    {"field":"data.history.end_time"}
                 }
             }
          }],
       "minimum_should_match":2
      }
   },
   "_source":
      ["data.primary_job_id",
       "data.secondary_job_id",
       "data.allocation_id",
       "data.user_name",
       "data.begin_time",
       "data.history.end_time",
       "data.state"
      ],
   "size":1000
}
'
thanh-lam commented 3 years ago

A note on how the python scripts query for "--starttime" and "--endtime" as following example:

$ ./findUserJobs.py -u <user> --starttime A --endtime B

A and B need to have the correct date and time format like the one in transaction log. That turns out to be a elasticsearch date format: "strict_date_optional_time_nanos". A and B form the two edges of the window where all allocations within it will be returned.

                  starttime               endtime
begin_time:           |<------------------------B
end_time:              A------------------------>|

The recommendation is: always input A and B. Leaving B out will result in following query:

                  starttime                endtime
begin_time:           |<------------------------"now"
end_time:              A------------------------>|

In other word, the query will return all allocations from time A to "now". It's important to note: There's no left edge limit for begin_time nor right edge limit for end_time. Following two conditions are combined with the "AND" operator. ** Any allocation that has end_time >= A will be returned regardless of its begin_time. Any allocation that has begin_time <= B will be returned regardless of its end_time.

thanh-lam commented 3 years ago

We need to find out where or how the allocation records are written into transaction log so that the data for "begin_time" and "history.end_time" can be changed to the correct date format as discussed earlier.

CSM transaction log is the data source for filebeat to retrieve the allocation records that are shipped to elasticsearch indexes.

thanh-lam commented 3 years ago

Bill and I had done some sources digging and it seems the following two programs write allocation records to transaction log:

We just have to replace the ' ' with 'T' in allocation->begin_time.

Another observation: timestamp is also part of the allocation record and it already has the correct date format (with the T). Why allocation->begin_time was not that way? The difference in current data type mapping is that timestamp has type date but begin_time has type string.

thanh-lam commented 3 years ago

I found a way to load a new index mapping for cast-allocation on-the-fly. With the manually modified date format for begin_time and end_time, the python scripts seem to work at least with the date. For example, following command returned two allocation records in Jan:

[root@c685f4n07 python]# ./findUserJobs.py -u tlam --starttime "2021-01-06" --endtime "2021-01-12"
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
  complete |   251 |     1038 | 0        | 2021-01-06T14:36:49.244459 | 2021-01-06T14:37:54.391456
  complete |   252 |     1039 | 0        | 2021-01-11T11:27:11.463397 | 2021-01-11T11:27:31.771474

Increasing the starttime did exclude the first record and return only the second record:

[root@c685f4n07 python]# ./findUserJobs.py -u tlam --starttime "2021-01-11" --endtime "2021-01-12"
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
  complete |   252 |     1039 | 0        | 2021-01-11T11:27:11.463397 | 2021-01-11T11:27:31.771474
thanh-lam commented 3 years ago

Searching for the CSM code that writes allocation records to the transaction log is a daunting task that requires understanding the design and structure of CSM sources.

It also took quite some searching, reading, understanding, try-outs and errors on elasticsearch online docs. But I finally made a break through with a fix by adding a mutate filter in logstash config file:

/etc/logstash/conf.d/logstash.conf

The mutate process should be added into the filter section: (Lines marked with ">>>")

else if "transaction" in [tags] {
    mutate { remove_field => [ "beat", "host", "source", "offset", "prospector"] }

### Use logstash gsub filter plugin to process input data before sending to elastic index.
### For the two fields: data.begin_time and data.history.end_time, the blank is replaced
### with a letter T. Note: these fields have type "date" that requires the valid format.
### Also note: No change to csm_transaction.log with this process.

>>>    mutate {
>>>        gsub => [
>>>           "[data][begin_time]", " ", "T",
>>>           "[data][history][end_time]", " ", "T"
>>>        ]
>>>    }

}
else if "ras" in [tags] and "csm" in [tags] {
    mutate{
        gsub => [ "time_stamp", ".(\d{3})\d{3}", ".\1" ]
        add_field => { "type" => "log-ras"  }
    }

    date {
        match => ["time_stamp", "ISO8601","YYYY-MM-dd HH:mm:ss.SSS" ]
        target => "time_stamp"
    }
}

This is definitely a better solution than having CSM write the new date format into transaction log. It doesn't require changes in CSM source or log files.

thanh-lam commented 3 years ago

For verification, after inserting the new mutate filter into logstash.conf and restarted logstash, try create an allocation. Then check transaction.log.0 for the new allocation record for begin_time and history.end_time date format:

{"type":"allocation","traceid":151499658,"uid":272,"timestamp":"2021-03-02T21:35:58Z","data":{"allocation_id":272,"begin_time":"2021-03-02 16:35:58.805708","primary_job_id":1361,"secondary_job_id":0,"ssd_file_system_name":"","launch_node_name":"f5n06","user_flags":"","system_flags":"","ssd_min":0,"ssd_max":0,"num_nodes":2,"num_processors":0,"num_gpus":0,"projected_memory":0,"state":"running","type":"jsm-cgroup-step","job_type":"batch","user_name":"tlam","user_id":11745,"user_group_id":100,"user_script":"user script not known by LSF","account":"","comment":"","job_name":"","job_submit_time":"2021-03-02 16:34:47","queue":"normal","requeue":"","time_limit":2147483647,"wc_key":"","isolated_cores":2,"compute_nodes":["f10n16","f10n15"]} }
{"type":"allocation","traceid":151499658,"uid":272,"timestamp":"2021-03-02T21:35:59Z","data":{"running-start-timestamp":"2021-03-02 16:35:58.990888"}}
{"type":"allocation","traceid":841396274,"uid":272,"timestamp":"2021-03-02T21:36:18Z","data":{"state":"complete","history":{"end_time":"2021-03-02 16:36:18.830969"},"running-end-timestamp":"2021-03-02 16:36:18.830969"}}

Notice that the date+time format remains the same, with a blank between date and time. However, a python script that searches for user jobs will display the date+time format with the letter T between date and time. That means the fix has written the correct format into Elasticsearch index cast-allocation.

     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
  complete |   267 |     1356 | 0        | 2021-03-02T16:30:46.674554 | 2021-03-02T16:31:06.911078
  complete |   268 |     1357 | 0        | 2021-03-02T16:34:39.823075 | 2021-03-02T16:35:00.121053
  complete |   269 |     1358 | 0        | 2021-03-02T16:35:01.72255  | 2021-03-02T16:35:18.721014
  complete |   270 |     1359 | 0        | 2021-03-02T16:35:20.41316  | 2021-03-02T16:35:40.431049
  complete |   271 |     1360 | 0        | 2021-03-02T16:35:41.092173 | 2021-03-02T16:35:58.160964
  complete |   272 |     1361 | 0        | 2021-03-02T16:35:58.805708 | 2021-03-02T16:36:18.830969

Before the fix, those date+time would have had a blank in place of T.

thanh-lam commented 3 years ago

As discovered in a comment earlier:

A and B need to have the correct date and time format like the one in transaction log. That turns out to be a elasticsearch date format: "strict_date_optional_time_nanos".

The date+time format in the csm_transaction.log.0 is in nanos unit. That's one of the reasons why time range queries with epoch_millis time don't work. The script findUserJobs.py is one example.

Found the time format defined in cast_helper.py. Changing that to nanos time gets findUserJobs.py work with time range queries.

[root@c650f01p22 python]# grep TIME_SEARCH *.py
#cast_helper.py:#TIME_SEARCH_FORMAT = 'yyyy-MM-dd HH:mm:ss'
#cast_helper.py:#TIME_SEARCH_FORMAT = "epoch_millis"
cast_helper.py:TIME_SEARCH_FORMAT = "strict_date_optional_time_nanos"

This change eliminates calls of convert_time_range() that converts time to epoch_millis. These calls are in build_time_range() and build_timestamp_range().

[root@c650fXXpXX python]# grep convert_time *.py
cast_helper.py:def convert_timestamp( timestamp ):
cast_helper.py:    #start_time = convert_timestamp(start_time)
cast_helper.py:    #end_time = convert_timestamp(end_time)
cast_helper.py:    #start_time = convert_timestamp(start_time)
cast_helper.py:    #end_time   = convert_timestamp(end_time)

Before the change:

[root@c650fXXpXX python]# ./findUserJobs.py -u root --starttime 2021-03-01
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
  complete |     1 |        1 | 0        | 2021-03-11T10:01:20.626411 | 2021-03-11T10:01:25.266256
  complete |     2 |        1 | 0        | 2021-03-11T10:01:26.803619 | 2021-03-11T10:01:29.501622
  complete |     3 |        1 | 0        | 2021-03-11T10:02:11.384713 | 2021-03-11T10:02:12.296012

[root@c650fXXpXX python]# ./findUserJobs.py -u root --starttime 2021-03-11T10:02:11.000000 --endtime 2021-03-11T10:02:12.999999
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time 

After the change to use nanos:

[root@c650fXXpXX python]# ./findUserJobs.py -u root --starttime 2021-03-11T10:02:11.000000 --endtime 2021-03-11T10:02:12.999999
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
  complete |     3 |        1 | 0        | 2021-03-11T10:02:11.384713 | 2021-03-11T10:02:12.296012
thanh-lam commented 3 years ago

@williammorrison2 found another script that has problem with time format "epoch_millis | nanos".

[root@c650fXXpXX python]# ./findJobsRunning.py -T 2021-03-01
A RequestError was recieved:
  type: search_phase_execution_exception
  reason: all shards failed
  root_0_type: parse_exception
  root_0_reason: failed to parse date field [1614574800] with format [strict_date_optional_time_nanos]: [failed to parse date field [1614574800] with format [strict_date_optional_time_nanos]]

But, it is defined in a different way, not from convert_time_range().

date_print_format='%s'

That should be changed to the following format for compatible with nanos.

    # Parse the user's date.
    date_format='(\d{4})-(\d{1,2})-(\d{1,2})[ \.T]*(\d{0,2}):{0,1}(\d{0,2}):{0,1}(\d{0,2})'
    date_print_format='%Y-%m-%dT%H:%M:%S'
    #date_print_format='%s'