animetosho / Nyuu

Flexible usenet binary posting tool
215 stars 30 forks source link

Streaming 7-Zip creation #95

Closed sntran closed 6 months ago

sntran commented 2 years ago

Hi again,

I see that this feature is planned, but am not sure where it is on your roadmap. I actually have a need for this to stream a folder with many files from remote source into a zip file before uploading.

I could create the archive before handing it off to nyuu, but since the files are in remotes, it would require double the disk space.

I can also try to contribute to this feature, but I would like to hear about your vision on how it should be implemented.

animetosho commented 2 years ago

Development on new features are largely stalled at the moment, so you're welcome to have a go if you want. There's some stubs to get you started - essentially you need to create a stream to pass to the uploader (like for regular files), and update bits above to pass the size along (and probably other things I can't remember).

Otherwise, if you can find a streaming archive client, and can get the exact file size upfront, the procjson feature may be good enough.

sntran commented 2 years ago

Thanks for the first steps! One thing that occurs to me is that by adding the input files into an archive stream, we essentially lose the size, as the final size of the archive is unknown, which would not be accepted by the uploader. That is how the procjson feature works, isn't it? How would we get around that?

animetosho commented 2 years ago

How would we get around that?

You can't. yEnc requires knowing the total size upfront, so working around this isn't possible.

This means that you probably won't be able to use any compression, as you can't predict the final size before hand.
Regardless of how you do it, you'll need to know the size of the resulting archive upfront, which likely means you'll have to disable compression.

sntran commented 2 years ago

zip_size = num_of_files * (30 + 16 + 46) + 2 * total_length_of_filenames + total_size_of_files + 22 for a copy-only archive.

Sounds simple enough :) There are of course differences between archivers, but good thing is that we control which archiver to use, so hopefully it would be a straightforward process.

Just to be upfront, I hate archiving, but for such use case, I need a container for all the files, instead of posting many 5KB files. Therefore, 0-level compression is a great choice, for both zipping and for nzbget to handle faster later on.

I'll take a look to see whether it makes sense to add archiving to my tool (which streams remote files to nyuu) or directly to nyuu. Single-responsibility and such :)

sntran commented 2 years ago

Further manual testing shows that yazl follows that formula without compression, while p7zip returns a slightly bigger size with -m0=Copy flag.

Are you open to add yazl as another dependency? Or would you prefer handling the archiving directly?

animetosho commented 2 years ago

From a quick glance, that looks like a nice library. Make sure to use the size it reports than trying to compute it yourself - even if you control the archiver, it can change with a different version, so best to use their value.

For inclusion in Nyuu, I generally follow a principle of minimal dependencies (to make installation easy), but don't mind it as an optional dependency. Would be nice if there was a way to use node-yencode's built-in CRC instead of their buffer-crc32 (computing CRC in Javascript doesn't exactly scream performant), but that's just a nice to have.

The 7z format has some benefits over ZIP for this use case, mostly standardised encoding of filenames (where ZIP doesn't) and the ability to compress metadata (mostly useful if there's a lot of files).

So overall, sounds good to me.

sntran commented 2 years ago

Make sure to use the size it reports than trying to compute it yourself.

Not sure I can do that. The input is from stream, and yazl doesn't reports the archive size immediately, but we do need that size upfront, at least that's how I understand it working with procjson.

The 7z format has some benefits over ZIP for this use case, mostly standardised encoding of filenames (where ZIP doesn't) and the ability to compress metadata (mostly useful if there's a lot of files).

I would also prefer 7z, but the 7za only takes one input stream from stdout, which is not enough.

animetosho commented 2 years ago

I haven't tried using the library, but the readme sounds like the size should be given upfront, as long as compression is disabled:

If finalSize is -1, it means means the final size is too hard to guess before processing the input file data. This will happen if and only if the compress option is true on any call to addFile(), addReadStream(), or addBuffer(), or if addReadStream() is called and the optional size option is not given. In other words, clients should know whether they're going to get a -1 or a real value by looking at how they are using this library.

The call to finalSizeCallback might be delayed if yazl is still waiting for fs.Stats for an addFile() entry. If addFile() was never called, finalSizeCallback will be called during the call to end(). It is not required to start piping data from outputStream before finalSizeCallback is called. finalSizeCallback will be called only once, and only if this is the first call to end().

If it doesn't work that way, perhaps file a bug report or ask the author about it.

sntran commented 2 years ago

Yeah, that finalSizeCallback is only called when you call end(), which needs to be done after adding the inputs.

For what we want to do, I believe we need to know the size before handling any inputs.

I have filed an issue on yazl requesting the functionality.

sntran commented 6 months ago

Hi there,

Sorry for the delay. After a bit of research, I think it's not easy to add this feature. I'll close this issue for now. Thanks.