Open farukcankaya opened 3 years ago
Parquet | HDF5 | SQLite | CSV | |
---|---|---|---|---|
Consistency across different platforms |
? | ✅ | ✅ | ✅ (dialect) |
Support and documentation | ✅ | ✅ | ✅ | ✅ |
Read/write speed | ✅ | so-so | ❌ | ❌ |
Incremental reads/writes |
Yes, but not supported by current Python libs |
✅ | ✅ | Yes (but not random access) |
Supports very large and high-dimensional datasets | ✅ | ✅ | ❌ (limited nr. columns per table) |
✅ Storing tensors requires flattening. |
Simplicity | ✅ | ❌ (basically full file system) |
✅ (it's a database) | ✅ |
Metadata support | Only minimal | ✅ | ✅ | ❌ (requires separate metadata file) |
Maintenance | Apache project, open and quite active |
Closed group, but active community on Jira and conferences |
Run by a company. Uses an email list. |
✅ |
Available examples of usage in ML |
✅ | ✅ | ❌ | ✅ |
Flexibility | Only tabular | Very flexible, maybe too flexible |
Relational multi-table | Only tabular |
Versioning/Diff | Only via S3 or delta lake | ❌ | ❌ | ✅ |
Different length vectors | As blob | ✅ | ❌ ? | ✅ |
DataFrame
df_csv = pd.read_csv('csv_example')
If the given CSV file does not generated by DataFrame before, it will generate a column Unnamed. To eliminate this column, we can save the DataFrame with an index parameter:
df.to_csv('csv_example', index=False)
df_csv = pd.read_csv('csv_example', header = 0)
df_csv = pd.read_csv('csv_example', header=[1,2,5])
df_csv = pd.read_csv('csv_example', header=1)
df_csv = pd.read_csv('csv_example', names=['a', 'b', 'c'])
df_csv = pd.read_csv('csv_example', names=['a', 'b', 'c'], header=1)
df.to_csv('csv_example', index=False, header = False)
The default delimiter for CSV is comma ",". However, it could be changed while either in reading or writing.
df_csv = pd.read_csv('csv_example', sep=":")
df.to_csv('csv_example', index=False, sep=":")
df_csv.set_index('column_name')
df_csv = pd.read_csv('csv_example', sep=":", index_col=1)
df_csv = pd.read_csv('csv_example', sep=":", index_col=[0,2])
df_csv = pd.read_csv('csv_example', sep=":", nrows=3)
By default, empty lines are skipped in CSV file format. If you need to take care of empty lines to county empty pages for example, you can count it by marking skip_blank_lines=False:
df_csv = pd.read_csv('csv_example', skip_blank_lines=False, sep=":")
Total number of characters that a cell can contain: 32,767 Source: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3#:~:text=218%20characters%20%2D%20This%20includes%20the,xlsx.
Since pickle
keeping the Dataframe object as it is and it requires less space, we will use pickle
to save preprocessed data.
This came up first when we talked about the character limit of excel files. It would make more sense to dig it deeper in this issue to find the best option for us to store processed dataset.