Write multiple data frames in parallel

renkun-ken commented 5 years ago

My use case is that I have multiple data frames (each 1-3GB) in which there are several integer columns and 50+ double columns, I need to write them to /dev/shm/ as fast as possible. To make it fast, I need to use compress=0, but in this case the writing is single-threaded.

I'm wondering if it is possible to write multiple data frames into multiple paths in multi-threading way? Do you think it would make the writing much faster than writing sequentially? If it does, it may look like

fst::threads_fst(0)
fst::write_fst(list(df1, df2, df3), c("/dev/shm/file1.fst", "/dev/shm/file2.fst", "/dev/shm/file3.fst"), compress = 0)

MarcusKlik commented 5 years ago

Hi @renkun-ken,

those are very interesting questions!

To start with your first remark, contrary to what you might think, the uncompressed writes are also multithreaded, but the actual work being done is confined to a single memcpy statement (see this code). The reason is that a memcpy ensures that the data is in cache before writing it out to file, speeding up the write times (see also this comment on stackoverflow).

(it might be though that the system monitoring shows very limited core activity because the memcpy is much faster than the single threaded writes to file).

On the second part of your question: I definitely think that writing multiple files simultaneously can be beneficial to the speed for most setups. I have found that the write speed can be lower than the maximum SSD speed because of a bottleneck in getting the data from cache to disk. I believe that this bottleneck could be reduced by using multiple cores to write and, as you say, that can be done by using multiple output files.

With your use case, you're basically fusing 3 independent writes into a single function call, which can definitely help speed up things.

We could also separate the logic of the arguments and allow for multiple datasets and a (possibly different number of) multiple paths. We would then assume that the user wants to fuse the 3 datasets into a single larger dataset. And that single larger dataset can be written to a single file or be spread out over multiple files. That way, you could also have speed gains if you specify only a single dataset but multiple output files, do you think that would be useful?

(are your datasets part of a large dataset?)

renkun-ken commented 5 years ago

Thanks for your detailed reply! My use case is that the datasets are independent (different number of rows and columns). It would be nice if it is made more powerful in your comment.

I'll take a closer look at how it may be implemented and probably implement it for my work in my fork. It would be nice if you like this feature and accept the PR it if it works well for me.

MarcusKlik commented 5 years ago

Yes, for independent datasets we could use the arguments as you propose, the one-to-one relation between input and output makes it clear to the user what's going on. Alternatively, an argument map_in_out = TRUE could be used to specify the one-to-one relation between inputs and outputs.

But the implementation would not be straightforward :-). To get multiple threads writing, we need to split the OpenMP thread pool in one group used for writing (perhaps 2 threads to start with) and one group for processing such as compression / decompression, memcpy etc. (the remaining threads). So we basically need a way to schedule the threads to different chunks in the available datasets (and these changes need to be made in the underlying fstlib library).

For the use case with a single larger (but split) dataset I think the best solution would be to use a higher level of (file system) abstraction:

instead of a file, a single dataset maps to a folder on disk
that folder could contain a single or multiple fst files and an index (which should be of no concern to the user)
a group of such folders constitutes a datastore of multiple datasets

With such a setup, we can easily write multiple files simultaneously for performance boosts, also for single-object datasets. The same holds for reading, using mutiple file handles for reading might speed up things in the same way as with the writing.

And it would be possible to add columns by just adding fst files to the folder linked to the dataset without messing with existing data (and overwriting (the metadata of) existing fst files). Also adding rows could be done by just adding more fst files to the folder.

The folders could also be used to cache temporary files needed to do offline sorting of large datasets (for example by using a merge sort algorithm).

To make such an abstraction possible, we would need the concept of an offline dataset and datastore:


library(fst)

# link to the store
store <- fst_store("path_to_folder")

# retrieve links to offline tables
ft1 <- fst_table(store, "my_table1")
ft2 <- fst_table(store, "my_table2")

# do things with the data
ft1 %>%
  select(columnA, columnB) %>%
  left_join(2, by = "columnA")

or something like that.

fstpackage / fst

Write multiple data frames in parallel #227