Closed djouallah closed 2 weeks ago
Thanks for raising this up! This has been something we have been thinking about adding as well.
@colin-ho could you pick this issue up?
As an added bonus: if we could figure out that the dataframe is partitioned by filename (if no file splitting was performed) that could be really cool.
This could enable easy and cheap data manipulation such as: df.read_parquet("...", filename=True).groupby("filename").count()
Example use-case for counting number of distinct rows, grouped by filename:
any update on this ?
Hi @djouallah , sorry for the delay, I'm currently finalizing the PR for this, will let you know once it is ready
Hey @djouallah, this feature should be ready in the next release!
This feature is ready in v0.3.9, closing the issue.
a common pattern when reading from csv. json etc, is to add a column in the destination table with the files processed already, so the next time you add new csv files, you will not endup with duplicate values, duckdb/polars for example support this function using filename = true