Closed mrocklin closed 9 years ago
This could be improved with a more rigorous system to escape names for the file system. Does anyone know of a common solution here?
One of my potential improvement plans was to remap the names to an iteration (something like part_1
). Since we have the partitions stored in meta, then the partitions series is just pd.Series(sorted(glob('part_*')), index=partitions)
. Simple, consistent (and valid) filenames.
Woops, just realized this is a problem with columns, not the index. This stackoverflow has some ideas, but they're all pretty shaky. I'm not sure if there is a rigorous way to do this cross-platform.
Eventually I'd like to remove separate partition files all-together and have just one file per column with an index of byte locations. I suspect that this would improve large read times.
I'm now using a whitelist and hashing. Merging soon if no comment.
Fixes #41 cc @enku