brendan-duncan / archive

Dart library to encode and decode various archive and compression formats, such as Zip, Tar, GZip, ZLib, and BZip2.
MIT License
400 stars 140 forks source link

Push RAM usage limit beyond 2GB in browser environments #291

Closed androidseb closed 8 months ago

androidseb commented 11 months ago

What are these changes?

Refactors around the FileHandle class to be able to extend it and getting rid of the path field in classes that don't need it.

The main idea is allowing more control over the InputFileStream and OutputFileStream classes to pass a custom file handler.

Along with this abstraction, introducing the RamFileData class to be used with InputFileStream and OutputFileStream, allowing to use RAM as stream input/output.

Why?

Why do this, use ZipDecoder.decodeBuffer(... and pass a RAM-stored FileHandle to an InputFileStream, instead of simply using ZipDecoder.decodeBytes(...?

This is the only way I could find to encode and decode large files on the web: in most browsers, the largest Uint8List size that can be allocated is limited to 2GB - however I found that allocating 3GB worth of 1MB arrays is still possible. In the event we want to decode a 3GB zip file, we can store it into the RAM if we split it into multiple buffers, and read a larger than 2GB zip file like this.

brendan-duncan commented 11 months ago

Thanks for this work. I'll need some time to look over it, work is hammering me again. To help me out with something I'm not fully understanding from a very brief peek, is this doing something that InputStream (memory based streaming) isn't doing?

androidseb commented 11 months ago

I'll need some time to look over it

No worries, I'll need some time to field-test it extensively to ensure it's fully working. This MR is still in a draft state. I'll post an update here when I deem the code has reached a stable state. I'll also add code samples for Flutter Web to showcase how this can be leveraged.

is this doing something that InputStream (memory based streaming) isn't doing?

Yes indeed. InputStream stores a single List<int> in memory. Flutter Web's dart runtime limits any list size to a 32 bits signed int max value (about 2 Billions, I believe this is due to an underlying Javascript limitation). This means InputStream max capacity is limited to about 2GB on Web today. The RamFileData class I introduced aims to achieve exactly the same thing, but using a List<List<int>> instead, and doing the footwork to provide the requested data based on the provided start/end read/write params. Each individual chunk read/write is still limited to 2GB, but as long as you use a reasonable buffer size, this is completely fine. This pushes the limit of encoding/decoding zip files on Web from 2GB to basically the computer/browser's RAM.

I have been able to successfully decode a zip file of 2.4GB on Web with this method, and I'm currently working on testing an implementation to confirm it works to encode a zip file as well. I'm wrestling with the Javascript function window.showSaveFilePicker because that's the only way I see to write a file of over 2GB from the web app in multiple chunks. I'm not 100% sure it can be done yet, this is still theoretical...

brendan-duncan commented 11 months ago

That sounds great!

showSaveFilePicker is an experimental API, among other issues.

I have another github project where I force a file download from generated data in javascript, if that's what you're trying to do with that. In https://github.com/brendan-duncan/webgpu_recorder/blob/main/webgpu_recorder.js#L242C14-L242C14 _downloadFile method, it creates an link element, creates a data blob, then activates the link which causes the data blob to download. The data blob takes a list of data. Whether it will work for your case, that I don't know.

androidseb commented 11 months ago

showSaveFilePicker is an experimental API

Yes indeed, and only a handful of browsers support it: https://developer.mozilla.org/en-US/docs/Web/API/Window/showSaveFilePicker#browser_compatibility

among other issues

Are there other issues you see with it? I'm trying to consider the downsides of this approach even though it seems like my only technical option to zip files larger than 2GB at the moment...

I have another github project where I force a file download from generated data in javascript

Thanks, I checked the code, this is my current way of exporting data as well (before using this PR's code), and is limited to 2GB too, because of this:

link.href = URL.createObjectURL(new Blob([data], {type: 'text/html'}));

To create the link object's link, you need to pass the entire data in the variable data, and if that data is larger than 2GB, it won't fit in a single List<int>...

The goal I have in mind is allow the users to choose the showSaveFilePicker option if it's available for their browser, knowing that passed a certain volume of data, this will be the only way to make the export of the data work... Even though that feature is experimental, offering this option is better than nothing when nothing else works. If the browser implementation changes, it will probably be to allow another way to achieve this in a better way, so when that happens, I'll update my implementation too, but the Javascript array size limit is not going away any time soon I think...

androidseb commented 11 months ago

So I can confirm that this works, I tested this on the latest version of my beta web app that I deployed here: https://www.mapmarker.app/webapp_beta/ and I was able to export a 2.2GB (2 202 755 488 bytes) zip file like this. Even though it's painfully slow, it works.

For reference, here is the code that allows me to save the RamFileData to the disk:

Future<T> _callJsMethod<T>(js.JsObject calledObj, String methodName, [List<dynamic>? params]) async {
  final Completer<T> completer = Completer<T>();
  final js.JsObject promiseObj = calledObj.callMethod(methodName, <dynamic>[
    if (params != null) ...params,
  ]) as js.JsObject;
  final js.JsObject thenObj = promiseObj.callMethod('then', <dynamic>[
    (
      T promiseResult,
    ) {
      completer.complete(promiseResult);
    }
  ]) as js.JsObject;
  thenObj.callMethod('catch', <dynamic>[
    (
      js.JsObject errorResult,
    ) {
      completer.completeError(
        '_callJsMethod encountered an error calling the method "$methodName": $errorResult',
      );
    }
  ]);
  return completer.future;
}

Future<void> _saveRamFileDataAsFile(String fileName, RamFileData ramFileData) async {
  // https://developer.mozilla.org/en-US/docs/Web/API/FileSystemFileHandle
  final js.JsObject fileSystemFileHandle = await _callJsMethod(
    js.context,
    // https://developer.mozilla.org/en-US/docs/Web/API/Window/showSaveFilePicker
    'showSaveFilePicker',
    <dynamic>[
      js.JsObject.jsify(<String, Object>{
        'suggestedName': fileName,
        'writable': true,
      }),
    ],
  );
  // https://developer.mozilla.org/en-US/docs/Web/API/FileSystemWritableFileStream
  final js.JsObject fileSystemWritableFileStream = await _callJsMethod<js.JsObject>(
    fileSystemFileHandle,
    'createWritable',
  );
  const int defaultBufferSize = 1024 * 1024;
  Uint8List? buffer;
  for (int i = 0; i < ramFileData.length; i += defaultBufferSize) {
    final int bufferSize = math.min(defaultBufferSize, ramFileData.length - i);
    if (buffer == null || bufferSize != buffer.length) {
      buffer = Uint8List(bufferSize);
    }
    ramFileData.readIntoSync(buffer, i, i + bufferSize);
    // https://developer.mozilla.org/en-US/docs/Web/API/FileSystemWritableFileStream/write
    await _callJsMethod<void>(
      fileSystemWritableFileStream,
      'write',
      <dynamic>[
        html.Blob(<dynamic>[buffer])
      ],
    );
  }
  await _callJsMethod<void>(
    fileSystemWritableFileStream,
    'close',
  );
}

Out of scope for this PR and in today's context, given the zip decoder functions are synchronous and don't support async future calls, but dreaming about even better future options, I looked at the Origin Private File System (not experimental if I understand correctly?) and that could allow large zip file decoding on Web without having to fit everything in the RAM.

Getting back to the real world, I will fine-tune and test this implementation, and I'll write up some sample test code that can be easily reused to leverage the RamFileData structure with the archive library.

androidseb commented 11 months ago

I have added a code sample in the example folder, I hope it's the best place to add this.

As far as I can tell, the implemented changes are fully working and covered by tests so I'll mark this PR as "ready" now.

Note: I can't figure out how to fix the failing CI check - it says it's failing on this command:

dart analyze --fatal-infos

but when I run this on my machine it's working just fine. I also don't understand the error message: error - lib/src/io/file_buffer.dart:44:11 - The method 'closeSync' isn't defined for the type 'AbstractFileHandle'. Try correcting the name to the name of an existing method, or defining a method named 'closeSync'. - undefined_method, I'll probably need some assistance with this.

brendan-duncan commented 11 months ago

I haven't forgotten about this, I just haven't had a chance to do any hobby work lately, I'll get to it when I can.

androidseb commented 9 months ago

Hey @brendan-duncan, happy new year :-) Have you had time to look at this yet?

brendan-duncan commented 9 months ago

Oy, thanks for the reminder, I completely forgot about this. And how is it the new year already? It was just October last week!

One of the things I noticed about this PR is you mention wanting to use this for browser environments, but all of the changes are in the IO side of things. archive_io is for dart:io centric code, which is not supported by browser environments. The abstract classes and implementations that want to be used by browser environments and don't rely on dart:io should be moved out into the broader area.

androidseb commented 8 months ago

@brendan-duncan I have done some light refactor to move the newly introduced feature classes outside of IO, but following that logic, a few more classes would need to be refactored and moved outside of IO as well, for example input_file_stream.dart and output_file_stream.dart. I had tried to keep things light and merge-friendly, but those would be breaking changes and we'd need to increment the major version by one, is that OK?

brendan-duncan commented 8 months ago

Hmm, that is getting complicated. I really wish I had more time to focus on this, I'm sorry it's dragging out. Family and work have been leaving me with no time to do any personal work for a while. I'm not ready to push a major version, there's a lot more changes I would make if there were to be a major version, and I don't know if I'll be able to get to them. I'll need to figure this out.

androidseb commented 8 months ago

@brendan-duncan I've been thinking about a simpler way to avoid dragging this for too long while keeping things in the appropriate folder:

What do you think? Should I give that a try?

brendan-duncan commented 8 months ago

Thinking about this some more, I think leaving it as it is will be the better compromise for now. It will give me more incentive to do the big 4.0 refactor. Just the nit note about the package vs relative path for imports, and I can pull this in.