TUM-IDP-WS-20 / doc

0 stars 0 forks source link

find a good file storage option #20

Open farukcankaya opened 3 years ago

farukcankaya commented 3 years ago

This came up first when we talked about the character limit of excel files. It would make more sense to dig it deeper in this issue to find the best option for us to store processed dataset.

farukcankaya commented 3 years ago
Parquet HDF5 SQLite CSV
Consistency across
different platforms
? ✅ (dialect)
Support and documentation
Read/write speed so-so
Incremental
reads/writes
Yes, but not
supported by current
Python libs
Yes (but not
random access)
Supports very large and high-dimensional datasets ❌ (limited nr. columns
per table)
✅ Storing tensors
requires flattening.
Simplicity ❌ (basically full
file system)
✅ (it's a database)
Metadata support Only minimal ❌ (requires separate
metadata file)
Maintenance Apache project, open
and quite active
Closed group,
but active community on
Jira and conferences
Run by a company.
Uses an email list.
Available examples of
usage in ML
Flexibility Only tabular Very flexible,
maybe too flexible
Relational multi-table Only tabular
Versioning/Diff Only via S3 or delta lake
Different length vectors As blob ❌ ?

Source: https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html

farukcankaya commented 3 years ago

Playing with pandas DataFrame in CSV

Read csv file to DataFrame

df_csv = pd.read_csv('csv_example') If the given CSV file does not generated by DataFrame before, it will generate a column Unnamed. To eliminate this column, we can save the DataFrame with an index parameter: df.to_csv('csv_example', index=False)

Playing with header

Changing column names

Determine delimiter

The default delimiter for CSV is comma ",". However, it could be changed while either in reading or writing.

Sorting

Limit the data that will be loaded

Empty lines

By default, empty lines are skipped in CSV file format. If you need to take care of empty lines to county empty pages for example, you can count it by marking skip_blank_lines=False: df_csv = pd.read_csv('csv_example', skip_blank_lines=False, sep=":")

farukcankaya commented 3 years ago

Total number of characters that a cell can contain: 32,767 Source: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3#:~:text=218%20characters%20%2D%20This%20includes%20the,xlsx.

farukcankaya commented 3 years ago

source: https://stackoverflow.com/questions/48770542/what-is-the-difference-between-save-a-pandas-dataframe-to-pickle-and-to-csv

farukcankaya commented 3 years ago

Since pickle keeping the Dataframe object as it is and it requires less space, we will use pickle to save preprocessed data.