Tar archiving of huge data

adamhathcock / sharpcompress

SharpCompress is a fully managed C# library to deal with many compression types and formats.

MIT License

2.28k stars 480 forks source link

Tar archiving of huge data #323

Open 4ybaka opened 6 years ago

4ybaka commented 6 years ago

I want to use sharpcompress to create huge tar archives (TBs) without compression but I can't do it by next reasons:

There is no async interface for TarWriter. Is it by design?
There is no interface to write buffer. Yes, I can create another stream class to gather data but there is overhead to convert buffer (that I already have) into stream just to unwrap it into buffer inside sharpcompress.
Actually TarHeader class will be enough for me but by some reason it is internal. Why?

Is there any chance that some of these issues will be resolved?

turbolocust commented 6 years ago

You could capture the task and thus solve your first problem. See my solution here: https://github.com/turbolocust/SimpleZIP/blob/master/SimpleZIP_UI/Application/Util/WriterUtils.cs

It can also be done without a child or nested task, but the immediate cancellation of the whole operation isn't as reliable then.

adamhathcock commented 6 years ago

1) Making everything async is kind of just a matter of search/replace. I’m not 100% there’s benefit but I love async/await so hey.

2). I’m not 100% sure what you’re asking for here

3)Not sure how this helps.

I don’t honestly see what’s blocking the creation of large tar files. Maybe a code sample will help.

4ybaka commented 6 years ago

One of issues with archiving huge amount of data is resuming of process (due to instance reboot, IO failure, etc). At this moment I have next issues with sharpcompress:

On Dispose TarWriter will "close" archive with double call of PadTo512(0, true);
Providing sync stream to TarWriter means that I should block on all async operations in stream.
In my API I have buffer pool with 4MB arrays. So to transfer 1GB need 262 buffers. But in real life used about 10-30 buffers (they just reused when don't need anymore). But sharpcompress will allocate additionally 13K buffers of 80KB each.

If TarHeader class will be available outside of library then it is pretty easy to implemenet resume logic - if written data length more then header size - just skip header and some part of content. Otherwise serialize header and skip part of it's content.

adamhathcock commented 6 years ago

Now using ArrayPool for Skip/Transfer: https://github.com/adamhathcock/sharpcompress/pull/326

This should help for 3

adamhathcock commented 6 years ago

Fix for 1: https://github.com/adamhathcock/sharpcompress/pull/327

I felt like I did it for a reason though

adamhathcock commented 6 years ago

I would like to make it async all the but that's a bigger PR.

4ybaka commented 6 years ago

@adamhathcock do you have any thoughts regarding PR?

4ybaka commented 6 years ago

@adamhathcock when do you plan to create a new release? I want to use a new version with new writer options.