add support for adding file path as a new column when read from csv/json etc

Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

https://getdaft.io

Apache License 2.0

2.31k stars 160 forks source link

add support for adding file path as a new column when read from csv/json etc #2808

Closed djouallah closed 2 weeks ago

djouallah commented 2 months ago

a common pattern when reading from csv. json etc, is to add a column in the destination table with the files processed already, so the next time you add new csv files, you will not endup with duplicate values, duckdb/polars for example support this function using filename = true

kevinzwang commented 2 months ago

Thanks for raising this up! This has been something we have been thinking about adding as well.

jaychia commented 1 month ago

@colin-ho could you pick this issue up?

jaychia commented 1 month ago

As an added bonus: if we could figure out that the dataframe is partitioned by filename (if no file splitting was performed) that could be really cool.

This could enable easy and cheap data manipulation such as: df.read_parquet("...", filename=True).groupby("filename").count()

jaychia commented 1 month ago

Example use-case for counting number of distinct rows, grouped by filename:

djouallah commented 4 weeks ago

any update on this ?

colin-ho commented 4 weeks ago

Hi @djouallah , sorry for the delay, I'm currently finalizing the PR for this, will let you know once it is ready

colin-ho commented 3 weeks ago

Hey @djouallah, this feature should be ready in the next release!

colin-ho commented 2 weeks ago

This feature is ready in v0.3.9, closing the issue.