Minio buckets/keys structure

Cerfoglg commented 9 years ago

Minio stores file using a key/value structure, organised inside a series of partitions (buckets). It's important to define what buckets are going to be created on Minio, and the format for the keys pointing to our files.

Buckets should be named using the DNS structure as described here http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html . For our benchmarks, we are going to use a series of high level buckets for our different data, like "benchmarks".

While Minio and S3 file storage is not a conventional file system, it is general practice that keys follow the same format as a conventional file systems, like for example "folder/subfolder/foobar.txt" refering to a file "foobar.txt".

For our Minio file storage, we want keys inside the "runs" bucket to be represented with this format:

hash_value/experiment_id/trial_id/container_name/collector_name/data_name/data

Where we indicate in the key:

A hash value computed from the key
The experiment ID
The trial ID
The name of the container where the collector that performed the data collection was running,
The name of the collector itself
The name of the data being collected
The name of the file containing the data, or a folder that contains multiple files with the data (example: data = DB_NAME/table_name.csv

The hash value at the start of the string is used to speed up key lookup. By adding a prefix in the form of a hash value computed from the rest of the key (such as a modulo operation) we create more unique prefixes and thus reduce the amount of characters that need to be compared when performing the lookup. More information here https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

VincenzoFerme commented 9 years ago

@Cerfoglg Thank you for the detailed description.

As I recall, we discussed to have a different buckets for "runs", "models", etc ... Why is it different in your description? Have you identified some issues in the solution we discussed about?

Cerfoglg commented 9 years ago

I've looked into it a bit more, and buckets don't ultimately matter as much as I thought, so doing this makes organisation easier in the end, without loss in performance. The hashing is what does the trick to speed up lookup.

VincenzoFerme commented 9 years ago

Ok, if it does not impact the performance I would then stick to have different buckets for a better data arrangement (differentiate different types of data we need to access from different services).

Cerfoglg commented 9 years ago

Let use have, instead of a single benchmarks bucket, different buckets to separate our data. For starter we have for sure a runs bucket, and a models bucket for instance

VincenzoFerme commented 9 years ago

@Cerfoglg can you please update the issue description according to the current state?

Cerfoglg commented 8 years ago

@VincenzoFerme Updated with current minio key structure

VincenzoFerme commented 8 years ago

Refer to https://github.com/benchflow/experiments-manager/issues/8#issuecomment-240847959

benchflow / data-transformers

Minio buckets/keys structure #4