[Performance Idea] Exclude first X characters related to bucket

catalystneuro / dandi_s3_log_parser

S3 log parsing for the DANDI Archive.

BSD 3-Clause "New" or "Revised" License

1 stars 2 forks source link

[Performance Idea] Exclude first X characters related to bucket #8

Closed CodyCBakerPhD closed 2 months ago

CodyCBakerPhD commented 4 months ago

Every line likely starts with 8787a3c41bf7ce0d54359d9348ad5b08e16bd5bb8ae5aa4e1508b435773a066e dandiarchive for this bucket. Since we're excluding this info in the final result anyway, we could probably speed up parsing by scrubbing it from all lines (and regex parsing) shortly after read from the buffer

Just need to set a static offset index for all raw lines

CodyCBakerPhD commented 4 months ago

( we will still have to load them into RAM in order to do this; I just wonder if it would offer global speedup to other operations to do this as a pre-step )

CodyCBakerPhD commented 4 months ago

The question is though, does the extra time it takes to exclude the lines via something like [line[30:] for line in io.readlines()] exceed the time savings from parsing a reduced regex?

CodyCBakerPhD commented 3 months ago

Might also want to try reading the first X characters of each line into some kind of numpy array (static contiguous block) for vectorization

CodyCBakerPhD commented 2 months ago

Vectorization or other contiguous binary forms are out the window, the structure is far too irregular in size for that per line

Since the strings are already loaded from I/O into RAM, it would also cost more to copy a substring via slicing and likely would not give any net performance gain after that