[C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

asfimport commented 3 years ago

Currently, the dataset writing (eg with pyarrow.dataset.write_dataset) uses a fixed filename template ("part\{i\}.ext"). This means that when you are writing to an existing dataset, you de facto overwrite previous data when using this default template.

There is some discussion in ARROW-10695 about how the user can avoid this by ensuring the file names are unique (the user can specify the basename_template to be something unique). There is also ARROW-7706 about silently doubling data (so not overwriting existing data) with the legacy parquet.write_to_dataset implementation.

It could be good to have a "mode" when writing datasets that controls the different possible behaviours. And erroring when there is pre-existing data in the target directory is maybe the safest default, because both appending vs overwriting silently can be surprising behaviour depending on your expectations.

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

[Python] [Dataset] Add partition_filename_cb to ds.write_dataset() (fixes)
[Python] saving a dataframe to the same partitioned location silently doubles the data (relates to)
[C++] [Dataset] Dataset repartition / filter / update (is related to)
[C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset (is related to)
[C++] More fine-grained control of file creation in filesystem layer (depends upon)

_{Note: This issue was originally created as ARROW-12358. Please see the migration documentation for further details.}

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: As mentioned by @ldacey in ARROW-10695 (and also in the comment on my SO answer), one of the consequences of the current default behaviour is that it will sometimes overwrite and sometimes append data, depending on what files are already present and how many parts your are writing.
It would probably be useful to be able to either fully overwrite or either always append.

Taking inspiration for possible "modes" from ARROW-7706:

"overwrite": overwrite existing data
- But should it clear all existing data first, or only overwrite files when file names match (i.e. basically the current behaviour)?
- Both behaviours might actually be useful depending on your use case?
"append": append new data to existing data
- But can we do this automatically with the default filename template? Because then if there are already part-0.parquet and part-1-.parquet files present in a certain partition, should it automatically infer the "current counter" to write a part-2.parquet"? (that seems rather complicated, especially if the max counter varies across partitions)
"error": raise an error if data already exists
- It can check if the specified base directory is empty or not (it's probably fine to have the directory itself already exist, as long as it is empty), and error if not empty.
In addition, it is mentioned that Spark also has a "ignore" option (silently ignore the write operation if data already exists), but not sure this is important to add.

asfimport commented 3 years ago

Lance Dacey / @ldacey: I think that having an "overwrite" option would satisfy my need for the partition_filename_cb https://issues.apache.org/jira/browse/ARROW-12365 if we can replace all data inside the partition. This would be great for file compaction as well - we could read a dataset with a lot of tiny file fragments and then overwrite it.

Overwriting a specific file is also useful. For example, my basename_template is usually my f"{task-id}-{schedule-timestamp}-{file-count}-{i}.parquet". I am able clear a task and overwrite a file which already exists. The only flaw here is that we cannot control the {i} variable so I guess it is not guaranteed. I could live without this.

For "append", is it possible for the counter to be per partition instead (potential race conditions if multiple tasks write to the same partition in parallel perhaps, and it seems to be a more demanding step for large datasets..)? Or could the {i} variable optionally be a uuid instead of the fragment count?

"error" makes sense.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: cc @westonpace @bkietz

asfimport commented 3 years ago

Weston Pace / @westonpace: tl;dr: Do what @jorisvandenbossche said and interpret "overwrite" as "overwrite the entire partition".

https://stackoverflow.com/questions/27033823/how-to-overwrite-the-output-directory-in-spark is related (talks about this issue and how it is handled in Spark). Even reading through all the answers however I cannot tell if "overwrite" replaces the entire partition or the entire dataset. It does appear to do one or the other and not just replacing some of a partition. Only replacing some of a partition does not seem like it would ever be useful.

Overwriting the entire table could always be easily achieved without pyarrow by simply removing the dataset beforehand so I don't see much value in adding that capability. Although it does bring up the question of repartitioning which would require deleting the old data as it is read, but I think that is a different topic (and related to the update topic I mention below). Deleting a partition isn't very hard for the user either. The tricky part though is knowing which partition to delete.

With that in mind I'd suggest the following:

Overwrite-partition: If the dataset write will write to partition X then delete all data in partition X first.

Append: Same as @jorisvandenbossche mentioned. Similar to how we behave today but add logic to make sure we never overwrite a file that happens to have the same counter (e.g. detect the max counter value before we start writing and continue the old counter)

Error: Same as @jorisvandenbossche mentioned.

The overwrite-partition mode is useful for the case of "Load the entire dataset (or an entire partition), modify it, write it back out".

However, I think the use case that is still missing is:

Run a filtered scan of the data
Modify this subset of data
Write it back out, intending to overwrite the old rows

In other words, something equivalent to the SQL "UPDATE dogs SET favorite=1 WHERE breed='poodle'"

Overwrite-partition won't work because it would delete any non-poodle data. Append wouldn't work because it would duplicate the poodle data.

Perhaps however, that can be a separate operation. It brings up troubling atomicity and consistency concerns. Although if we created such an operation then presumably there would be no need for an overwrite mode.

asfimport commented 3 years ago

Lance Dacey / @ldacey: Being able to update and replace specific rows would be very powerful. For my use case, I am basically overwriting the entire partition in order to update a (sometimes tiny) subset of rows. That means that I need to read the existing data for that partition which was saved previously, and the new data with updated or new rows. Then I need to sort and drop duplicates (I use pandas because there is no simple .drop_duplicates() for a pyarrow table, but adding a step with pandas can add some complication sometimes with data types), then I need to overwrite the partition (I use the partition_filename_cb to guarantee that the final file for the partition is the same).

asfimport commented 3 years ago

Weston Pace / @westonpace: So looking on this with fresh eyes, the "overwrite mode" feature is fairly different from an "update" feature. So I don't think update related topics are relevant for this ticket. Update generally (and specifically in @ldacey 's case) implies reading and writing to the same set of files. Overwrite-partition mode wouldn't allow for that. Overwrite-partition mode could be useful in some limited circumstances (e.g. somehow someone regenerates an entire new set of data for one or more partitions) but I think those are rare enough, and would be handled by a general "update" feature anyways, that I don't see much benefit in creating a separate feature and the complexity would just confuse users.

So I'll walk back my earlier comment. I'd now argue that dataset write should only allow "append" and "error" options.

Dataset update could be created as a separate Jira ticket (I'll go ahead and draft one). Dataset update would mean scanning and rewriting a dataset (or parts thereof).

asfimport commented 3 years ago

Lance Dacey / @ldacey: What is the common workflow pattern for folks trying to imitate something similar to a view in a database?

In many of my sources I have a dataset which is append only (using UUIDs in the basename template), normally partitioned by date. If this data is downloaded frequently or is generated from multiple sources (for example, several endpoints or servers), then each partition might have many files. Most likely there are also different versions of each row (one ID will have a row for each time it was updated, for example).

I then write to a new dataset which is used for reporting and visualization.

Get the list of files which were saved to the append-only dataset during the most recent schedule
Create a dataset from the list of paths which were just saved and use .get_fragments() and ds._get_partition_keys(fragment.partition_expression) to generate a filter expression (this allows me to query for all of the data in each relevant partition which was recently modified - so if only a single row was modified in the 2021-08-05 partition, then I still need to read all of the other data in that partition in order to finalize it)
Create a dataframe, sort the data and drop duplicates on a primary key, convert back to a table (it would be nice to be able to do this purely in a pyarrow table so I could leave out pandas!)
Use pq.write_to_dataset() with partition_filename_cb=lambda x: str(x[-1]) + ".parquet" to write to a final dataset. This allows me to overwrite the relevant partitions because the filenames are the same. I can be certain that I only have the latest version of each row.

This is my approach to come close to what I would achieve with a view in the database. It works fine, but the storage is essentially doubled since I am maintaining two datasets (append-only and final). Our visualization tool connects directly to these parquet files, so there is some benefit in having less files (one per partition instead of potentially hundreds) as well.

asfimport commented 3 years ago

Weston Pace / @westonpace: Do you clear your append only dataset after step 4? In other words, is it just a temporary staging area (which stays rather small) or are you wanting to keep the duplicate rows in your base dataset?

So, to check my understanding, I think what you are describing is a materialized view with incremental refresh. Does this sound right?

In other words, the results of a query (in your case the query is sort of a "group by date" where you take the latest row instead of aggregating anything) are saved off (saved off meaning you don't have to recompute the query each time) as a view and you want to update the view when new data arrives but should only have to read the new data when computing the update.

Some thoughts to your current approach...

If you can perform the update often enough then the append-only table ought to still be cached in memory in the kernel disk cache so reading the newly added data should be fast. If you don't need to keep this data then write it in IPC format (perhaps even to a tmpfs mount) and the access should be even faster. There are some durability concerns here but Arrow doesn't generally make durability guarantees anyways.
For steps 2 & 3 there is more and more work being done in the CPU layer to add compute and relational algebra into Arrow itself. Eventually the hope is to support a sort of low level query IR which is currently being discussed in the ML. This may ease some of the work here but the learning curve is pretty steep at the moment. You could scan the append-only dataset to get the minimum date value. Then you could create a second dataset which is a scan of the old view >= min_date. Then you could union these two datasets, apply an order by date, and drop duplicates. This would allow you to do everything in Arrow. However, it's currently all in-progress. "order by" was just added (ARROW-13540) and there is no drop duplicates yet that I am aware of although there may be a way to do this with group by and the right aggregate kernel.

asfimport commented 3 years ago

Lance Dacey / @ldacey:

I do not clear my append dataset, but I need to add tasks to consolidate the small files someday. If I download a source every hour, I will have a minimum of 24 files in a single daily partition and some of them might be small.

But yes, I am basically describing a materialized view. I cannot rely on an incremental refresh in many cases because I partition data based on the created_at date and not the updated_at date.

Here is an example where the data was all updated today, but there were some rows originally created days or even months ago.


table = pa.table(
    {
        "date_id": [20210114, 20210811, 20210812, 20210813],    #based on the created_at timestamp
        "created_at": ["2021-01-14 16:45:18", "2021-08-11 15:10:00", "2021-08-12 11:19:26", "2021-08-13 23:01:47"],
        "updated_at": ["2021-08-13 00:04:12", "2021-08-13 02:16:23", "2021-08-13 09:55:44", "2021-08-13 22:36:01"],
        "category": ["cow", "sheep", "dog", "cat"],
        "value": [0, 99, 17, 238],
    }
)

Partitioning this by date_id would save the following files in my "append" dataset. Note that this has one row which is from January, so I cannot do an incremental refresh from the minimum date because it would be too much data in a real world scenario.


written_paths = [
    "dev/test/date_id=20210812/test-20210813114024-2.parquet",
    "dev/test/date_id=20210813/test-20210813114024-3.parquet",
    "dev/test/date_id=20210811/test-20210813114024-1.parquet",
    "dev/test/date_id=20210114/test-20210813114024-0.parquet",
]

During my next task, I create a new dataset from the written_paths above (so a dataset of only the new/changed data). Using .get_fragments() and partition expressions, I ultimately generate a filter expression:


fragments = ds.dataset(written_paths, fs).get_fragments()
for frag in fragments:
    partitions = ds._get_partition_keys(frag.partition_expression)
#... other stuff
filter_expression = 
<pyarrow.dataset.Expression is_in(date_id, {value_set=int32:[
  20210114,
  20210811,
  20210812,
  20210813
], skip_nulls=true})>

Finally, I use that filter to query my "append" dataset which has all historical data. So I read all of the data in each partition


df = ds.dataset(source, fs).to_table(filters=filter_expression).to_pandas()

, convert the table to pandas, sort and drop duplicates, convert back to a table, and then save to my "final" dataset with partition_filename_cb to overwrite whatever was there. This means that if even a single row was updated within a partition, I will be read all of the data in that partition and recompute the final version of each row. This also requires me to use the "use_legacy_dataset" flag to support overwriting the existing partitions.

I found a custom implementation of drop_duplicates (https://github.com/TomScheffers/pyarrow_ops/blob/main/pyarrow_ops/ops.py) using pyarrow Tables, but I am still just using pandas for now. I keep a close eye on the pyarrow.compute() docs and have been slowly replacing stuff I do with pandas directly in the pyarrow tables, which is great.

You mentioning the temporary staging area got me to realize that I could replace my messy staging append dataset (many small files) with something temporary that I delete each schedule, and then read from it and create a consolidated historical append-only dataset similar to what I am doing in the example above (one file per partition instead of potentially hundreds)

asfimport commented 3 years ago

Weston Pace / @westonpace: I've added some customization here in https://github.com/apache/arrow/pull/10955 via "existing_data_behavior". This will provide the options...

kError - Raise an error if there are any files or directories in `base_dir` (the new default)
kOverwriteOrIgnore - Existing files will be ignored unless the filename is one of those chosen by the dataset writer in which case they will be overwritten (the old default)
kDeleteMatchingPartitions - This is similar to the dynamic partition overwrite mode in parquet. The first time a directory is written to it will delete any existing data.

This was based partially on discussion in ARROW-7706. I think kDeleteMatchingPartitions might simplify your step 4 but I'm not entirely sure. Feedback welcome!

asfimport commented 3 years ago

Lance Dacey / @ldacey: kDeleteMatchingPartitions - So this only deletes the individual partitions and not the entire dataset correct? So if I save a dataset made up of hundreds of partitions but only 4 of them are written to, then only those 4 partitions will have their existing files cleared? If so, then yes that should work for me.

asfimport commented 3 years ago

Weston Pace / @westonpace: Yes, that is correct. It will only delete a partition that will be updated / modified as part of the write_dataset operation.