IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

CSM BDS: python scripts TypeError with --state reverted #990

Open thanh-lam opened 3 years ago

thanh-lam commented 3 years ago

Describe the bug For querying allocation data, CSM provides python scripts in /opt/ibm/csm/bigdata/python/. One example is "findUserJobs.py" that lists allocation info such as "state" and so on of a job. It produced following error when running with --state reverted. Other states (running, failed, complete) were listed with no error.

# ./findUserJobs.py -u tlam --state reverted
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
Traceback (most recent call last):
  File "./findUserJobs.py", line 167, in <module>
    sys.exit(main(sys.argv))
  File "./findUserJobs.py", line 135, in main
    data.get("state")))
TypeError: unsupported format string passed to NoneType.__format__

To Reproduce Steps to reproduce the behavior:

  1. Login to CSM master or BDS node as root.
  2. Change to /opt/ibm/csm/bigdata/python/ then run the command:
    # ./findUserJobs.py -u <userid> --state reverted
  3. See error.

Expected behavior The command should not produce the error (which looked like an internal condition needed to be handled with the reverted state). Example of a good command output:

# ./findUserJobs.py -u root
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time                  
  complete |     1 |      526 | 0        | 2020-09-17 12:06:01.183794 | 2020-09-17 12:06:58.382907
  complete |     2 |      527 | 0        | 2020-09-17 12:06:02.483204 | 2020-09-17 12:06:56.800503
  complete |     3 |      528 | 0        | 2020-09-17 12:10:25.346929 | 2020-09-17 12:10:26.374556
  complete |     4 |      530 | 0        | 2020-09-17 12:12:52.830307 | 2020-09-17 12:12:53.838649
  complete |     5 |      531 | 0        | 2020-09-17 12:18:24.195737 | 2020-09-17 12:18:25.275275
  complete |     6 |      532 | 0        | 2020-09-17 12:20:02.431532 | 2020-09-17 12:20:27.420373
  complete |     7 |      533 | 0        | 2020-09-17 12:22:12.25542  | 2020-09-17 12:22:33.375314
  complete |     8 |      534 | 0        | 2020-09-17 12:27:51.261522 | 2020-09-17 12:28:11.101704
  complete |     9 |      535 | 0        | 2020-09-17 12:28:01.331114 | 2020-09-17 12:28:11.447308
  complete |    10 |        1 | 0        | 2020-09-17 12:30:36.073652 | 2020-09-17 14:55:39.055421
  complete |    11 |        2 | 0        | 2020-09-17 12:30:36.137917 | 2020-09-17 14:55:40.508644
    failed |    24 |        1 | 0        | 2020-09-23 16:41:58.746747 | 2020-09-23 16:41:58.951312
  complete |   168 |      557 | 0        | 2020-10-28 10:35:17.335562 | 2020-10-28 10:41:13.049669
  complete |   169 |      661 | 0        | 2020-10-28 10:51:58.674377 | 2020-10-28 11:52:02.579073

Environment (please complete the following information):

Additional context The TypeError could be caused by some "empty" field in the data record with reverted state.

Issue Source: CSM regression tests.

thanh-lam commented 3 years ago

The script prints out the list of user jobs fine until it hit the TypeError, when jobs have state = reverted. Bill found out from the database or indices that "reverted" jobs have empty "end_time". And, python3 flags that as a TypeError when it tried to print out the job record, as in this print statement:

            print( print_fmt.format(
                data.get("allocation_id"), data.get("primary_job_id"), data.get("secondary_job_id"),
                data.get("begin_time"), cast.deep_get(data,"history","end_time"),
                data.get("state")))

To fix that, we need to check the field 'cast.deep_get(data,"history","end_time")' and print out a blank if it's empty. This is the closest fix we can get and it works exactly as it meant to be.

            condition = cast.deep_get(data, "history","end_time")
            print( print_fmt.format(
                data.get("allocation_id"), data.get("primary_job_id"), data.get("secondary_job_id"),
                data.get("begin_time"), cast.deep_get(data,"history","end_time") if (condition!=None) else " ",
                data.get("state")))

Adding the line "condition = ..." to make the code more readable for checking the field with "if ... else ..." condition.

thanh-lam commented 3 years ago

Similar fix can also be applied to another script "findJobsRunning.py".

                condition = cast.deep_get(data, "history","end_time")
                print(print_fmt.format(
                    data.get("allocation_id"), data.get("primary_job_id"), data.get("secondary_job_id"),
                    data.get("begin_time"), cast.deep_get(data, "history","end_time") if (condition!=None) else " "))
williammorrison2 commented 3 years ago

Thanks @thanh-lam for working with me and writing this up. I'm the process of reviewing some of the other scripts to ensure we catch similar cases. I will add the details to this specific issue.

williammorrison2 commented 3 years ago

Similar fix can also be applied to another script findJobsInRange.py.

            if data:
                condition = cast.deep_get(data, "history","end_time")
                print(print_fmt.format(
                    data.get("allocation_id"), data.get("primary_job_id"), data.get("secondary_job_id"),
                    data.get("begin_time"), cast.deep_get(data, "history","end_time") if (condition!=None) else " ",
                    data.get("user_name")))
williammorrison2 commented 3 years ago

@thanh-lam These are some examples of the query after the fix was implemented.

[root@c650f99p06 python]# ./findUserJobs.py -u tlam --state reverted
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time
[root@c650f99p06 python]# ./findUserJobs.py -u wcmorris --state reverted
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time
[root@c650f99p06 python]# ./findUserJobs.py -u root --state reverted
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time
  reverted |     6 |        1 | 0        | 2021-02-23 14:01:39.697209 |

[root@c650f99p06 python]# ./findUserJobs.py -u root
     State |   AID | P Job ID | S Job ID | Begin Time                 | End Time
  complete |     1 |        1 | 0        | 2021-02-23 12:04:34.828635 | 2021-02-23 12:04:39.513245
  complete |     2 |        1 | 0        | 2021-02-23 12:04:40.983847 | 2021-02-23 12:04:43.556549
  complete |     3 |        1 | 0        | 2021-02-23 12:05:01.829019 | 2021-02-23 12:05:02.492537
  complete |     4 |        1 | 0        | 2021-02-23 13:48:52.624415 | 2021-02-23 13:48:53.528137
  complete |     5 |        1 | 0        | 2021-02-23 14:00:14.318896 | 2021-02-23 14:03:32.978141
  reverted |     6 |        1 | 0        | 2021-02-23 14:01:39.697209 |
  complete |     7 |        1 | 0        | 2021-02-23 14:05:37.494822 | 2021-02-23 14:05:38.328102
  complete |     8 |        1 | 0        | 2021-02-23 14:08:06.726752 | 2021-02-23 14:08:07.399833
  complete |     9 |        1 | 0        | 2021-02-23 14:09:41.859691 | 2021-02-23 14:09:42.559594
  complete |    10 |        1 | 0        | 2021-02-23 14:16:08.829438 | 2021-02-23 14:16:09.533021
  complete |    11 |        1 | 0        | 2021-02-23 14:17:05.743261 | 2021-02-23 14:17:06.379795
  complete |    12 |        1 | 0        | 2021-02-23 14:18:37.053626 | 2021-02-23 14:18:37.73513
   running |    13 |        1 | 0        | 2021-02-23 14:26:50.28166  | 2021-02-23 14:26:50.970676
  complete |    14 |        1 | 0        | 2021-02-23 14:28:38.323487 | 2021-02-23 14:28:38.998807
  complete |    15 |        1 | 0        | 2021-02-23 14:35:32.508862 | 2021-02-23 14:35:33.167389
besawn commented 3 years ago

Fixed by PR #994.

besawn commented 3 years ago

@thanh-lam I'm going to leave this issue open until you have a chance to verify the fix in the next CAST build.