Closed CodyCBakerPhD closed 2 months ago
( we will still have to load them into RAM in order to do this; I just wonder if it would offer global speedup to other operations to do this as a pre-step )
The question is though, does the extra time it takes to exclude the lines via something like [line[30:] for line in io.readlines()]
exceed the time savings from parsing a reduced regex?
Might also want to try reading the first X characters of each line into some kind of numpy array (static contiguous block) for vectorization
Vectorization or other contiguous binary forms are out the window, the structure is far too irregular in size for that per line
Since the strings are already loaded from I/O into RAM, it would also cost more to copy a substring via slicing and likely would not give any net performance gain after that
Every line likely starts with
8787a3c41bf7ce0d54359d9348ad5b08e16bd5bb8ae5aa4e1508b435773a066e dandiarchive
for this bucket. Since we're excluding this info in the final result anyway, we could probably speed up parsing by scrubbing it from all lines (and regex parsing) shortly after read from the bufferJust need to set a static offset index for all raw lines