deeptools / pyBigWig

A python extension for quick access to bigWig and bigBed files
MIT License
218 stars 49 forks source link

Write bigwig to BytesIO object #113

Closed joachimwolff closed 3 years ago

joachimwolff commented 3 years ago

Hi Devon,

I have to write many bigwig files to disk, and I thought it is better to write only one tar.gz file instead of many single bigwigs. My idea to implement this is to write the bigwig data to a BytesIO object and save this to the tar.gz. With text files, I implemented a similar solution successfully.

with tarfile.open(args.outFileName, "w:gz") as tar:
    for i, file_content in enumerate(thread_data):

                for j, file_list in enumerate(file_content):

                    ##### Test if writing a regular bigwig works and it did
                    # bw = pyBigWig.open(file_name_list[i][j], 'w')
                    # # set big wig header
                    # bw.addHeader(file_list[0])
                    # bw.addEntries(file_list[1], file_list[2], ends=file_list[3], values=file_list[4])
                    # bw.close()

                    tar_info = tarfile.TarInfo(name=file_name_list[i][j])
                    tar_info.mtime = time.time()
                    file = io.BytesIO()
                    bw = pyBigWig.open(file, 'w')
                    bw.addHeader(file_list[0])
                    bw.addEntries(file_list[1], file_list[2], ends=file_list[3], values=file_list[4])
                    bw.close()
                    tar_info.size = file.getbuffer().nbytes
                    tar.addfile(tarinfo=tar_info, fileobj=file)

I get now the error that pyBigWig can't open the BytesIO object to write to:

Traceback (most recent call last):
  File "/home/wolffj/miniconda3/envs/hicexplorer_py38/bin/chicExportData", line 7, in <module>
    main()
  File "/home/wolffj/miniconda3/envs/hicexplorer_py38/lib/python3.8/site-packages/hicexplorer/chicExportData.py", line 442, in main
    bw = pyBigWig.open(file, 'w')
RuntimeError: Received an error during file opening!

Is it in general not possible to use BytesIO for this? Or am I doing anything obviously wrong? Any help is appreciated.

Best,

Joachim

ghuls commented 3 years ago

Probably pyBigWig needs to be able to seek in the stream to update the header or so. A gzip stream is not seekable during write. It might work with a plain tar file as that one should be seekable.

You could also try a zip file with compression=ZIP_STORED: https://docs.python.org/3/library/zipfile.html . This should store the bigWig files uncompressed (they have compression of their own, so it does not make sense to try to compress them again). The advantage of a zip file is, that it has a file/dir index, so it should be faster to list files from it than with tar (as you have to read through the whole tar file to find all file names).

joachimwolff commented 3 years ago

Thanks for your reply.

Probably pyBigWig needs to be able to seek in the stream to update the header or so. A gzip stream is not seekable during write. It might work with a plain tar file as that one should be seekable.

I am writing to BytesIO object, the compression or the tar itself matter only in the line tar.addfile(tarinfo=tar_info, fileobj=file). However, the error happens earlier. The problem seems to me that pyBigWig expects a real file, and that is exactly what I try to avoid. I don't want to have thousands of bigwig files written to disk, then merge them together to a tar, and then delete the bigwigs on the disk again.

dpryan79 commented 3 years ago

The file name passed to open() is passed directly to C in:

fopen(fname, mode);

So it has to be an actual string name of a file, rather than relying on some pythonic abstraction. Perhaps you can you mmap to come up with a solution to this (maybe there's a memory-mapped path mechanism), but I doubt it'd be a simple process.

joachimwolff commented 3 years ago

I see, thanks @dpryan79.

To document my solution: I create a temporary directory, write to it the files, and put the directory to a tar. Afterward, I delete the temporary directory.