catalystneuro / dandi_s3_log_parser

S3 log parsing for the DANDI Archive.
BSD 3-Clause "New" or "Revised" License
1 stars 2 forks source link

[Performance & Accuracy Idea] Abstract parsing method #9

Open CodyCBakerPhD opened 3 months ago

CodyCBakerPhD commented 3 months ago

There are several parsing methods that I might test or adjust over time; it might be nice to allow selectability of which one to use, either for accuracy or speed/memory performance

CodyCBakerPhD commented 3 months ago

Overall, idea leans towards a unified API for performing the S3 log parsing process itself, so that alternative versions are easier to drop in / swap

CodyCBakerPhD commented 3 months ago

Base class might follow the rough strategy here:

class S3LogParser:
    def __init__(parsed_folder_path: DirectoryPath, s3_log_file_path: FilePath | None = None, s3_log_folder_path: DirectoryPath | None = None):
        # assert XOR on paths options

        # if file, then parse single file according to rules of this class

        # if folder, then iterate directory structure according to rules of this class

    def _parse_line(line: str) -> FullLog | None:
        # Parse a single line of a single log file
        pass

    def _parse_lines(lines) -> list[FullLog]:
        # Read in and parse all lines (in buffered style) from a single log file

    def _reduce_elements(
        elements: list[str] = ["timestamps", "asset_id", "remote_ip", "bytes_sent"]  # Though actually a constrained literal over all possible 20+ fields
    ) -> list[ReducedLog]: # Though what constitutes a 'reduced log' type might change from class to class then...
          # Probably via __init__, control which subfields of an S3 log we which to reduce our parsed output to contain

    def _iterate_directory(s3_log_folder_path: DirectoryPath):
        # The rules for iterating directories; might need some inference on if it's a base/year/month level
        # Natsort did not work out of the box on the base