Open CodyCBakerPhD opened 3 months ago
Overall, idea leans towards a unified API for performing the S3 log parsing process itself, so that alternative versions are easier to drop in / swap
Base class might follow the rough strategy here:
class S3LogParser:
def __init__(parsed_folder_path: DirectoryPath, s3_log_file_path: FilePath | None = None, s3_log_folder_path: DirectoryPath | None = None):
# assert XOR on paths options
# if file, then parse single file according to rules of this class
# if folder, then iterate directory structure according to rules of this class
def _parse_line(line: str) -> FullLog | None:
# Parse a single line of a single log file
pass
def _parse_lines(lines) -> list[FullLog]:
# Read in and parse all lines (in buffered style) from a single log file
def _reduce_elements(
elements: list[str] = ["timestamps", "asset_id", "remote_ip", "bytes_sent"] # Though actually a constrained literal over all possible 20+ fields
) -> list[ReducedLog]: # Though what constitutes a 'reduced log' type might change from class to class then...
# Probably via __init__, control which subfields of an S3 log we which to reduce our parsed output to contain
def _iterate_directory(s3_log_folder_path: DirectoryPath):
# The rules for iterating directories; might need some inference on if it's a base/year/month level
# Natsort did not work out of the box on the base
There are several parsing methods that I might test or adjust over time; it might be nice to allow selectability of which one to use, either for accuracy or speed/memory performance