blaze / castra

Partitioned storage system based on blosc. **No longer actively maintained.**
BSD 3-Clause "New" or "Revised" License
153 stars 21 forks source link

escape directory names #42

Closed mrocklin closed 9 years ago

mrocklin commented 9 years ago

Fixes #41 cc @enku

mrocklin commented 9 years ago

This could be improved with a more rigorous system to escape names for the file system. Does anyone know of a common solution here?

jcrist commented 9 years ago

One of my potential improvement plans was to remap the names to an iteration (something like part_1). Since we have the partitions stored in meta, then the partitions series is just pd.Series(sorted(glob('part_*')), index=partitions). Simple, consistent (and valid) filenames.

jcrist commented 9 years ago

Woops, just realized this is a problem with columns, not the index. This stackoverflow has some ideas, but they're all pretty shaky. I'm not sure if there is a rigorous way to do this cross-platform.

mrocklin commented 9 years ago

Eventually I'd like to remove separate partition files all-together and have just one file per column with an index of byte locations. I suspect that this would improve large read times.

mrocklin commented 9 years ago

I'm now using a whitelist and hashing. Merging soon if no comment.