With the proposed PR, we can reduce the time needed to save the results to a VCF or CSV (e.g. Regenie format) file.
For the CSV format, instead of relying on pandas.to_csv, we can use numpy.ravel to flatten the dataframe and then we can build the string to be saved using Python string format.
As a side note, in case the proposed change is not of interest, I would suggest to change the default compression level used by pandas.to_csv, since by default it uses the highest compression level, which can save a few MB but increases the saving time a lot. Even if it could be specified manually by the user through the to_csvargs parameters of the io_to_formats._to_format method (e.g. to_csvargs = {compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}}), I would suggest setting it to a lower level by default (even 1) and only changing it if the user specifies a different level.
Note that in my implementation, I set it to 1 without the possibility to change it (although the performance comparison was done setting it to 1 even in your implementation).
As for the VCF format, the idea is pretty much the same. Even in your current implementation, the writing is done using the Python string format. However, formatting and writing one dataframe row at a time is very slow (about 100 minutes in my tests). Formatting the whole dataframe only once is much faster (like a few seconds).
With the proposed PR, we can reduce the time needed to save the results to a VCF or CSV (e.g. Regenie format) file.
For the CSV format, instead of relying on
pandas.to_csv
, we can usenumpy.ravel
to flatten the dataframe and then we can build the string to be saved using Python string format. As a side note, in case the proposed change is not of interest, I would suggest to change the default compression level used bypandas.to_csv
, since by default it uses the highest compression level, which can save a few MB but increases the saving time a lot. Even if it could be specified manually by the user through theto_csvargs
parameters of theio_to_formats._to_format
method (e.g.to_csvargs = {compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}}
), I would suggest setting it to a lower level by default (even 1) and only changing it if the user specifies a different level. Note that in my implementation, I set it to 1 without the possibility to change it (although the performance comparison was done setting it to 1 even in your implementation).As for the VCF format, the idea is pretty much the same. Even in your current implementation, the writing is done using the Python string format. However, formatting and writing one dataframe row at a time is very slow (about 100 minutes in my tests). Formatting the whole dataframe only once is much faster (like a few seconds).