find a good file storage option

farukcankaya commented 3 years ago

This came up first when we talked about the character limit of excel files. It would make more sense to dig it deeper in this issue to find the best option for us to store processed dataset.

farukcankaya commented 3 years ago

Exploring dataset formats for OpenML, Mar 23, 2020. Discussed file formats are:
Arrow / Feather
Parquet
SQLite
HDF5
CSV Simple comparison:

	Parquet	HDF5	SQLite	CSV
Consistency across different platforms	?	✅	✅	✅ (dialect)
Support and documentation	✅	✅	✅	✅
Read/write speed	✅	so-so	❌	❌
Incremental reads/writes	Yes, but not supported by current Python libs	✅	✅	Yes (but not random access)
Supports very large and high-dimensional datasets	✅	✅	❌ (limited nr. columns per table)	✅ Storing tensors requires flattening.
Simplicity	✅	❌ (basically full file system)	✅ (it's a database)	✅
Metadata support	Only minimal	✅	✅	❌ (requires separate metadata file)
Maintenance	Apache project, open and quite active	Closed group, but active community on Jira and conferences	Run by a company. Uses an email list.	✅
Available examples of usage in ML	✅	✅	❌	✅
Flexibility	Only tabular	Very flexible, maybe too flexible	Relational multi-table	Only tabular
Versioning/Diff	Only via S3 or delta lake	❌	❌	✅
Different length vectors	As blob	✅	❌ ?	✅

Source: https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html

farukcankaya commented 3 years ago

Playing with pandas DataFrame in CSV

Read csv file to `DataFrame`

df_csv = pd.read_csv('csv_example') If the given CSV file does not generated by DataFrame before, it will generate a column Unnamed. To eliminate this column, we can save the DataFrame with an index parameter: df.to_csv('csv_example', index=False)

Playing with header

By default, the first row is counted as header df_csv = pd.read_csv('csv_example', header = 0)
Header can be specified as a few rows: df_csv = pd.read_csv('csv_example', header=[1,2,5])
or a single row in a specified index: df_csv = pd.read_csv('csv_example', header=1)

Changing column names

df_csv = pd.read_csv('csv_example', names=['a', 'b', 'c'])
Even it can be combined with header options: df_csv = pd.read_csv('csv_example', names=['a', 'b', 'c'], header=1)
Or, we can simply ignore it while exporting to a file: df.to_csv('csv_example', index=False, header = False)

Determine delimiter

The default delimiter for CSV is comma ",". However, it could be changed while either in reading or writing.

Reading: df_csv = pd.read_csv('csv_example', sep=":")
Writing: df.to_csv('csv_example', index=False, sep=":")

Sorting

Set first header names then sort y name: df_csv.set_index('column_name')
Sorting by a column while reading: df_csv = pd.read_csv('csv_example', sep=":", index_col=1)
even sort by a multiple column while reading: df_csv = pd.read_csv('csv_example', sep=":", index_col=[0,2])

Limit the data that will be loaded

Load only determined first n rows df_csv = pd.read_csv('csv_example', sep=":", nrows=3)

Empty lines

By default, empty lines are skipped in CSV file format. If you need to take care of empty lines to county empty pages for example, you can count it by marking skip_blank_lines=False: df_csv = pd.read_csv('csv_example', skip_blank_lines=False, sep=":")

farukcankaya commented 3 years ago

Total number of characters that a cell can contain: 32,767 Source: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3#:~:text=218%20characters%20%2D%20This%20includes%20the,xlsx.

farukcankaya commented 3 years ago

csv
- ✅human readable
- ✅cross platform
- ⛔slower
- ⛔more disk space
- ⛔doesn't preserve types in some cases
pickle
- ✅fast saving/loading
- ✅less disk space
- ⛔non human readable
- ⛔python only
Also take a look at parquet format (to_parquet, read_parquet)
- ✅fast saving/loading
- ✅less disk space than pickle
- ✅supported by many platforms
- ⛔non human readable

source: https://stackoverflow.com/questions/48770542/what-is-the-difference-between-save-a-pandas-dataframe-to-pickle-and-to-csv

farukcankaya commented 3 years ago

Since pickle keeping the Dataframe object as it is and it requires less space, we will use pickle to save preprocessed data.

TUM-IDP-WS-20 / doc