searchkit/constraints: rewrite of binary search algorithm

mustafakemalgilor commented 1 year ago

Implemented a new binary search algorithm that no longer needs filemarkers or knowing the lines beforehand, which reduces the time spent applying a SearchConstraintSearchSince to a file, especially if the file is large in size.

Removed the following classes which are no longer necessary:

SkipRange
SkipRangeOverlapException
BinarySearchState
FileMarkers (and respective unit tests)
SeekInfo

Removed test_logs_since_junk_not_allow_unverifiable test case since we're no longer parsing all lines in the file.

Removed following functions from BinarySeekSearchBase:

_seek_and_validate
_check_line
_seek_next

Introduced the following new classes:

LogFileDateSinceOffsetSeeker (the main binary search class)
DateSearchFailedAtOffset (exception type)
NoLogsFoundSince (exception type)
NoDateFoundInLogs (exception type)

mustafakemalgilor commented 1 year ago

@dosaboy pinging you just in case you might have missed this one

dosaboy commented 1 year ago

@mustafakemalgilor apologies i had missed this. i will try to dedicate some time to testing it.

dosaboy commented 1 year ago

Some initial observations. I definitely prefer the approach of bisecting without having to rely on markers as that will save a lot of time and resources. I ran it against a sosreport (using hotsos) with lot of log history and I hit some exceptions e.g.

Traceback (most recent call last):
  File "/home/user1/git/dosaboy/searchkit/searchkit/search.py", line 1039, in execute
    fd.read(1)
  File "/usr/lib/python3.10/gzip.py", line 301, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.10/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.10/gzip.py", line 488, in read
    if not self._read_gzip_header():
  File "/usr/lib/python3.10/gzip.py", line 436, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'20')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user1/git/dosaboy/searchkit/searchkit/search.py", line 1044, in execute
    stats = self._run_search(fd)
  File "/home/user1/git/dosaboy/searchkit/searchkit/search.py", line 976, in _run_search
    offset = self.constraints_manager.apply_global(search_ids, fd)
  File "/home/user1/git/dosaboy/searchkit/searchkit/search.py", line 1140, in apply_global
    _offset = c.apply_to_file(fd)
  File "/home/user1/git/canonical/hotsos/hotsos/core/search.py", line 63, in apply_to_file
    return super().apply_to_file(*args, **kwargs)
  File "/home/user1/git/dosaboy/searchkit/searchkit/constraints.py", line 705, in apply_to_file
    result = seeker.run()
  File "/home/user1/git/dosaboy/searchkit/searchkit/constraints.py", line 550, in run
    bisect.bisect_left(self, self.constraint._since_date)
  File "/home/user1/git/dosaboy/searchkit/searchkit/constraints.py", line 496, in __getitem__
    ) = self.try_find_line_w_date(item)
  File "/home/user1/git/dosaboy/searchkit/searchkit/constraints.py", line 370, in try_find_line_w_date
    raise ValueError("Could not find ending line feed offset")
ValueError: Could not find ending line feed offset

I didnt go through them all yet but I suspect they were a result of invalid characters being hit during searches (i remember having to do some .decode('unicode_escape') to get around this kind of thing).

Another thing I notice it that this implementation will always result in a full bisect, even if the first line in the file is valid or last line is invalid. The existing code will do a pre-check of the first and last lines to avoid this which would save further time.

Regarding the code changes themselves, it is quite hard to debug due to high use of abbreviated var names for which it isnt immediately obvious what they refer to. That combined with the very complex returns from methods makes it even harder. I think if you can implement a class to represent the state returned by try_find_line_w_date() and pass that around as an object it would really help. Nice work though!

mustafakemalgilor commented 1 year ago

Hi Ed, thanks for the review, much appreciated!

Some initial observations. I definitely prefer the approach of bisecting without having to rely on markers as that will save a lot of time and resources. I ran it against a sosreport (using hotsos) with lot of log history and I hit some exceptions e.g.

Can you send me an example sosreport? hotsos's test suite is all green with the latest revision of this branch, so I'd like to add a test for whatever making this fail.

Another thing I notice it that this implementation will always result in a full bisect, even if the first line in the file is valid or last line is invalid. The existing code will do a pre-check of the first and last lines to avoid this which would save further time.

Well, I don't know what "valid" stands here for but the new algorithm will find the nearest valid line if it lands on an "invalid" line, e.g.:

01.01.1970 foo
01.01.1970 bar
nodate

Assuming that the lookup hits line "nodate", the algorithm will yield the previous valid line, "01.01.1970 bar". This is in place due to the fact that some logs span multiple lines, for example:

2019-04-04T14:47:35.023075Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,0256ddd2,2)
memb {
    0256ddd2,0
    947dd30b,0
    }
joined {
    }
left {
    }
partitioned {
    }
)
2019-04-04T14:47:35.023099Z 0 [Note] WSREP: Save the discovered primary-component to disk

Regarding the code changes themselves, it is quite hard to debug due to high use of abbreviated var names for which it isnt immediately obvious what they refer to. That combined with the very complex returns from methods makes it even harder. I think if you can implement a class to represent the state returned by try_find_line_w_date() and pass that around as an object it would really help.

I agree, I refactored and simplified the code quite a bit, should be easier to reason about now :)

dosaboy commented 1 year ago

Hi Ed, thanks for the review, much appreciated!

Some initial observations. I definitely prefer the approach of bisecting without having to rely on markers as that will save a lot of time and resources. I ran it against a sosreport (using hotsos) with lot of log history and I hit some exceptions e.g.

Can you send me an example sosreport? hotsos's test suite is all green with the latest revision of this branch, so I'd like to add a test for whatever making this fail.

Another thing I notice it that this implementation will always result in a full bisect, even if the first line in the file is valid or last line is invalid. The existing code will do a pre-check of the first and last lines to avoid this which would save further time.

Well, I don't know what "valid" stands here for but the new algorithm will find the nearest valid line if it lands on an "invalid" line, e.g.:
01.01.1970 foo
01.01.1970 bar
nodate
Assuming that the lookup hits line "nodate", the algorithm will yield the previous valid line, "01.01.1970 bar". This is in place due to the fact that some logs span multiple lines, for example:
2019-04-04T14:47:35.023075Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,0256ddd2,2)
memb {
  0256ddd2,0
  947dd30b,0
  }
joined {
  }
left {
  }
partitioned {
  }
)
2019-04-04T14:47:35.023099Z 0 [Note] WSREP: Save the discovered primary-component to disk
Regarding the code changes themselves, it is quite hard to debug due to high use of abbreviated var names for which it isnt immediately obvious what they refer to. That combined with the very complex returns from methods makes it even harder. I think if you can implement a class to represent the state returned by try_find_line_w_date() and pass that around as an object it would really help.

I agree, I refactored and simplified the code quite a bit, should be easier to reason about now :)

Awesome, nice work mustafa, will get to this asap. Regarding the comment about "valid" what I mean is if you have a constraint to say give me logs after 2023-01-10 and a log file with 50M lines where the last line is 2023-01-09 then you want to only read that last line to know that the rest of the file does not match the constraint. Similarly if the first line in the file is 2023-01-10 then you want to only read that first line to know that the whole file matches the constraint i.e. is "valid".

dosaboy commented 1 year ago

I have done some testing on sosreports with large log files and this seems to work well with even a small performance improvement. I have proposed some style consistency tweaks at https://github.com/mustafakemalgilor/searchkit/pull/2 but on the whole I think it's good!

dosaboy commented 1 year ago

i think this is good now, i have a branch with some tweaks and cleanups that I will apply once this is landed.

dosaboy / searchkit

searchkit/constraints: rewrite of binary search algorithm #10