Open thanh-lam opened 3 years ago
A change made in the query to replace data.begin_time
and data.history.end_time
with @timestamp
which is a field with date type proved the time range search worked on a date field but not a text field.
Query body: {'query': {'bool': {'must': [{'match': {'data.user_name': 'tlam'}}], 'should': [{'range': {'@timestamp': {'format': 'epoch_millis', 'lte': '1610487082000'}}}, {'range': {'@timestamp': {'format': 'epoch_millis', 'gte': '1610427600000'}}}, {'bool': {'must_not': {'exists': {'field': 'data.history.end_time'}}}}], 'minimum_should_match': 2}}, '_source': ['data.primary_job_id', 'data.secondary_job_id', 'data.allocation_id', 'data.user_name', 'data.begin_time', 'data.history.end_time', 'data.state'], 'size': 1000}
However, @timestamp
is not exactly data.begin_time
and data.history.end_time
.
A possible workaround for the problem is to create a new index that has the same fields as cast-allocation
only with the two fields data.begin_time
and data.history.end_time
having type date. Then, an elasticsearch reindex
can be run to copy the data from cast-allocation
to the new index.
But, the new index has a different name. Therefore, the current cast-allocation
index may have to be deleted and recreated anew. Then, data can be copied from the new index to the new cast-allocation
index. That will take some work but possible.
Never the less, the reindex
process also hit a problem with copying the text string to the date type.
POST _reindex
{
"source": {
"index": "cast-allocation"
},
"dest": {
"index": "cast-allocation-test"
}
}
That caused Exceptions:
{
"took": 273,
"timed_out": false,
"total": 252,
"updated": 0,
"created": 0,
"deleted": 0,
"batches": 1,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0,
"failures": [
{
"index": "cast-allocation-test",
"type": "_doc",
"id": "193",
"cause": {
"type": "mapper_parsing_exception",
"reason": "failed to parse field [data.begin_time] of type [date] in document with id '193'. Preview of field's value: '2020-11-02 13:04:22.413672'",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "failed to parse date field [2020-11-02 13:04:22.413672] with format [YYYY-MM-dd HH:mm:ss.000000]",
"caused_by": {
"type": "date_time_parse_exception",
"reason": "Text '2020-11-02 13:04:22.413672' could not be parsed at index 20"
}
}
},
"status": 400
},
...
Pending for further investigation......
For debugging purposes, we deleted the cast-allocation index in elasticsearch. Then, a new cast-allocation was created with the two fields (begin_time, data.end_time) now have type "date".
With this new index, the incoming data also needs to have the correct date format. The data sources for cast-allocation are from CSM transaction log, which can be manually edited to get the two fields in the correct format. For example:
{"type":"allocation","traceid":531084125,"uid":1,"timestamp":"2020-09-17T16:06:01Z","data":{"allocation_id":1,"begin_time":"2020-09-17 12:06:01.183794","primary_job_id":526,"secondary_job_id":0,"ssd_file_system_name":"","launch_node_name":"f5n06","user_flags":"","system_flags":"","ssd_min":0,"ssd_max":0,"num_nodes":1,"num_processors":0,"num_gpus":0,"projected_memory":0,"state":"running","type":"jsm","job_type":"batch","user_name":"root","user_id":0,"user_group_id":0,"user_script":"1 -q normal sleep 3333333333","account":"","comment":"","job_name":"","job_submit_time":"2020-09-17 12:06:57","queue":"normal","requeue":"","time_limit":2147483647,"wc_key":"","isolated_cores":0,"compute_nodes":["f10n16"]} }
{"type":"allocation","traceid":231640295,"uid":2,"timestamp":"2020-09-17T16:06:02Z","data":{"allocation_id":2,"begin_time":"2020-09-17 12:06:02.483204","primary_job_id":527,"secondary_job_id":0,"ssd_file_system_name":"","launch_node_name":"f5n06","user_flags":"","system_flags":"","ssd_min":0,"ssd_max":0,"num_nodes":1,"num_processors":0,"num_gpus":0,"projected_memory":0,"state":"running","type":"jsm","job_type":"batch","user_name":"root","user_id":0,"user_group_id":0,"user_script":"1 -q normal sleep 3333333333","account":"","comment":"","job_name":"","job_submit_time":"2020-09-17 12:06:59","queue":"normal","requeue":"","time_limit":2147483647,"wc_key":"","isolated_cores":0,"compute_nodes":["f10n15"]} }
{"type":"allocation","traceid":231640295,"uid":2,"timestamp":"2020-09-17T16:06:55Z","data":{"running-start-timestamp":"2020-09-17 12:06:55.93243"}}
A T letter has to be added to replace the space between date and time:
"begin_time":"2020-09-17 12:06:01.183794" ==> "begin_time":"2020-09-17T12:06:01.183794"
As a result, following simple script shows that the time range query works:
=================================
$ cat timezone-query.sh
curl -X GET "XX.X.X.XX:9200/cast-allocation/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query" : {
"range" : {
"data.begin_time" : {
"time_zone": "-05:00",
"gte" : "2021-02-05T14:11:28.578Z",
"lte" : "now",
"relation": "within"
}
}
}
}
'
======================================
Another script that mimics how the python scripts query for "--starttime" and "--endtime" also works.
$ cat startend-epochmil-query.sh
curl -X GET "XX.X.X.XX:9200/cast-allocation/_search?pretty" -H 'Content-Type: application/json' -d'
{"query":
{"bool":
{"must":
[{"match":
{"data.user_name":"wcmorris"}
}],
"should":
[{"range":
{"data.begin_time":
{"format":"epoch_millis",
"lte":1612562710000
}
}
},
{"range":
{"data.history.end_time":
{"format":"epoch_millis",
"gte":1612450800000
}
}
},
{"bool":
{"must_not":
{"exists":
{"field":"data.history.end_time"}
}
}
}],
"minimum_should_match":2
}
},
"_source":
["data.primary_job_id",
"data.secondary_job_id",
"data.allocation_id",
"data.user_name",
"data.begin_time",
"data.history.end_time",
"data.state"
],
"size":1000
}
'
A note on how the python scripts query for "--starttime" and "--endtime" as following example:
$ ./findUserJobs.py -u <user> --starttime A --endtime B
A and B need to have the correct date and time format like the one in transaction log. That turns out to be a elasticsearch date format: "strict_date_optional_time_nanos"
. A and B form the two edges of the window where all allocations within it will be returned.
starttime endtime
begin_time: |<------------------------B
end_time: A------------------------>|
The recommendation is: always input A and B. Leaving B out will result in following query:
starttime endtime
begin_time: |<------------------------"now"
end_time: A------------------------>|
In other word, the query will return all allocations from time A to "now". It's important to note: There's no left edge limit for begin_time nor right edge limit for end_time. Following two conditions are combined with the "AND" operator. ** Any allocation that has end_time >= A will be returned regardless of its begin_time. Any allocation that has begin_time <= B will be returned regardless of its end_time.
We need to find out where or how the allocation records are written into transaction log so that the data for "begin_time"
and "history.end_time"
can be changed to the correct date format as discussed earlier.
CSM transaction
log is the data source for filebeat
to retrieve the allocation records that are shipped to elasticsearch
indexes.
Bill and I had done some sources digging and it seems the following two programs write allocation records to transaction log:
CSMIAllocationCreate.cc
logging.cc
As in following example for begin_time
:
if ( allocation->begin_time ) free(allocation->begin_time);
allocation->begin_time = strdup( tuples[0]->data[1] );
We just have to replace the ' ' with 'T' in allocation->begin_time
.
Another observation: timestamp
is also part of the allocation record and it already has the correct date format (with the T). Why allocation->begin_time
was not that way? The difference in current data type mapping is that timestamp
has type date
but begin_time
has type string
.
I found a way to load a new index mapping for cast-allocation on-the-fly. With the manually modified date format for begin_time and end_time, the python scripts seem to work at least with the date. For example, following command returned two allocation records in Jan:
[root@c685f4n07 python]# ./findUserJobs.py -u tlam --starttime "2021-01-06" --endtime "2021-01-12"
State | AID | P Job ID | S Job ID | Begin Time | End Time
complete | 251 | 1038 | 0 | 2021-01-06T14:36:49.244459 | 2021-01-06T14:37:54.391456
complete | 252 | 1039 | 0 | 2021-01-11T11:27:11.463397 | 2021-01-11T11:27:31.771474
Increasing the starttime did exclude the first record and return only the second record:
[root@c685f4n07 python]# ./findUserJobs.py -u tlam --starttime "2021-01-11" --endtime "2021-01-12"
State | AID | P Job ID | S Job ID | Begin Time | End Time
complete | 252 | 1039 | 0 | 2021-01-11T11:27:11.463397 | 2021-01-11T11:27:31.771474
Searching for the CSM code that writes allocation records to the transaction log is a daunting task that requires understanding the design and structure of CSM sources.
It also took quite some searching, reading, understanding, try-outs and errors on elasticsearch online docs. But I finally made a break through with a fix by adding a mutate
filter in logstash
config file:
/etc/logstash/conf.d/logstash.conf
The mutate
process should be added into the filter
section: (Lines marked with ">>>")
else if "transaction" in [tags] {
mutate { remove_field => [ "beat", "host", "source", "offset", "prospector"] }
### Use logstash gsub filter plugin to process input data before sending to elastic index.
### For the two fields: data.begin_time and data.history.end_time, the blank is replaced
### with a letter T. Note: these fields have type "date" that requires the valid format.
### Also note: No change to csm_transaction.log with this process.
>>> mutate {
>>> gsub => [
>>> "[data][begin_time]", " ", "T",
>>> "[data][history][end_time]", " ", "T"
>>> ]
>>> }
}
else if "ras" in [tags] and "csm" in [tags] {
mutate{
gsub => [ "time_stamp", ".(\d{3})\d{3}", ".\1" ]
add_field => { "type" => "log-ras" }
}
date {
match => ["time_stamp", "ISO8601","YYYY-MM-dd HH:mm:ss.SSS" ]
target => "time_stamp"
}
}
This is definitely a better solution than having CSM write the new date format into transaction log. It doesn't require changes in CSM source or log files.
For verification, after inserting the new mutate filter into logstash.conf
and restarted logstash
, try create an allocation. Then check transaction.log.0
for the new allocation record for begin_time
and history.end_time
date format:
{"type":"allocation","traceid":151499658,"uid":272,"timestamp":"2021-03-02T21:35:58Z","data":{"allocation_id":272,"begin_time":"2021-03-02 16:35:58.805708","primary_job_id":1361,"secondary_job_id":0,"ssd_file_system_name":"","launch_node_name":"f5n06","user_flags":"","system_flags":"","ssd_min":0,"ssd_max":0,"num_nodes":2,"num_processors":0,"num_gpus":0,"projected_memory":0,"state":"running","type":"jsm-cgroup-step","job_type":"batch","user_name":"tlam","user_id":11745,"user_group_id":100,"user_script":"user script not known by LSF","account":"","comment":"","job_name":"","job_submit_time":"2021-03-02 16:34:47","queue":"normal","requeue":"","time_limit":2147483647,"wc_key":"","isolated_cores":2,"compute_nodes":["f10n16","f10n15"]} }
{"type":"allocation","traceid":151499658,"uid":272,"timestamp":"2021-03-02T21:35:59Z","data":{"running-start-timestamp":"2021-03-02 16:35:58.990888"}}
{"type":"allocation","traceid":841396274,"uid":272,"timestamp":"2021-03-02T21:36:18Z","data":{"state":"complete","history":{"end_time":"2021-03-02 16:36:18.830969"},"running-end-timestamp":"2021-03-02 16:36:18.830969"}}
Notice that the date+time format remains the same, with a blank between date and time. However, a python script that searches for user jobs will display the date+time format with the letter T between date and time. That means the fix has written the correct format into Elasticsearch
index cast-allocation
.
State | AID | P Job ID | S Job ID | Begin Time | End Time
complete | 267 | 1356 | 0 | 2021-03-02T16:30:46.674554 | 2021-03-02T16:31:06.911078
complete | 268 | 1357 | 0 | 2021-03-02T16:34:39.823075 | 2021-03-02T16:35:00.121053
complete | 269 | 1358 | 0 | 2021-03-02T16:35:01.72255 | 2021-03-02T16:35:18.721014
complete | 270 | 1359 | 0 | 2021-03-02T16:35:20.41316 | 2021-03-02T16:35:40.431049
complete | 271 | 1360 | 0 | 2021-03-02T16:35:41.092173 | 2021-03-02T16:35:58.160964
complete | 272 | 1361 | 0 | 2021-03-02T16:35:58.805708 | 2021-03-02T16:36:18.830969
Before the fix, those date+time would have had a blank in place of T.
As discovered in a comment earlier:
A and B need to have the correct date and time format like the one in transaction log. That turns out to be a elasticsearch date format: "strict_date_optional_time_nanos".
The date+time format in the csm_transaction.log.0
is in nanos unit. That's one of the reasons why time range queries with epoch_millis time don't work. The script findUserJobs.py
is one example.
Found the time format defined in cast_helper.py
. Changing that to nanos time gets findUserJobs.py
work with time range queries.
[root@c650f01p22 python]# grep TIME_SEARCH *.py
#cast_helper.py:#TIME_SEARCH_FORMAT = 'yyyy-MM-dd HH:mm:ss'
#cast_helper.py:#TIME_SEARCH_FORMAT = "epoch_millis"
cast_helper.py:TIME_SEARCH_FORMAT = "strict_date_optional_time_nanos"
This change eliminates calls of convert_time_range()
that converts time to epoch_millis. These calls are in build_time_range()
and build_timestamp_range()
.
[root@c650fXXpXX python]# grep convert_time *.py
cast_helper.py:def convert_timestamp( timestamp ):
cast_helper.py: #start_time = convert_timestamp(start_time)
cast_helper.py: #end_time = convert_timestamp(end_time)
cast_helper.py: #start_time = convert_timestamp(start_time)
cast_helper.py: #end_time = convert_timestamp(end_time)
Before the change:
[root@c650fXXpXX python]# ./findUserJobs.py -u root --starttime 2021-03-01
State | AID | P Job ID | S Job ID | Begin Time | End Time
complete | 1 | 1 | 0 | 2021-03-11T10:01:20.626411 | 2021-03-11T10:01:25.266256
complete | 2 | 1 | 0 | 2021-03-11T10:01:26.803619 | 2021-03-11T10:01:29.501622
complete | 3 | 1 | 0 | 2021-03-11T10:02:11.384713 | 2021-03-11T10:02:12.296012
[root@c650fXXpXX python]# ./findUserJobs.py -u root --starttime 2021-03-11T10:02:11.000000 --endtime 2021-03-11T10:02:12.999999
State | AID | P Job ID | S Job ID | Begin Time | End Time
After the change to use nanos:
[root@c650fXXpXX python]# ./findUserJobs.py -u root --starttime 2021-03-11T10:02:11.000000 --endtime 2021-03-11T10:02:12.999999
State | AID | P Job ID | S Job ID | Begin Time | End Time
complete | 3 | 1 | 0 | 2021-03-11T10:02:11.384713 | 2021-03-11T10:02:12.296012
@williammorrison2 found another script that has problem with time format "epoch_millis | nanos".
[root@c650fXXpXX python]# ./findJobsRunning.py -T 2021-03-01
A RequestError was recieved:
type: search_phase_execution_exception
reason: all shards failed
root_0_type: parse_exception
root_0_reason: failed to parse date field [1614574800] with format [strict_date_optional_time_nanos]: [failed to parse date field [1614574800] with format [strict_date_optional_time_nanos]]
But, it is defined in a different way, not from convert_time_range()
.
date_print_format='%s'
That should be changed to the following format for compatible with nanos.
# Parse the user's date.
date_format='(\d{4})-(\d{1,2})-(\d{1,2})[ \.T]*(\d{0,2}):{0,1}(\d{0,2}):{0,1}(\d{0,2})'
date_print_format='%Y-%m-%dT%H:%M:%S'
#date_print_format='%s'
Describe the bug Regardless of the date entered with
--starttime
or--endtime
, a script queries and returns all job or allocation records. The scriptfindUserJobs.py
is used here as example. All other scripts that support--starttime
and--endtime
options have similar issue.To Reproduce Steps to reproduce the behavior:
./findUserJobs.py --starttime 2021-01-12
to search for jobs started on that date.Expected behavior For these scripts to work correctly, they should only return the jobs within the time ranges specified with
--starttime
and|or--endtime
.Screenshots
I added some debugged prints in the script to show the time range that it created from the
--starttime
input.And the query the script will send to elasticsearch:
The time ranges and the query look correct. However, the problem is with the two fields
data.begin_time
anddata.history.end_time
which have type "text" or "string". For example:Environment (please complete the following information):
Additional context Add any other context about the problem here.
Issue Source: CSM regression tests