WICG / storage-foundation-api-explainer

Explainer showcasing a new web storage API, NativeIO
Apache License 2.0
64 stars 8 forks source link

How does this proposal relate to other filesystem web APIs? #4

Open othermaciej opened 3 years ago

othermaciej commented 3 years ago

How does this API relate to File System Access API, File API and File and Directory Entries API? Those three technologies seem to relate to and integrate with each other to various extents, but this Explainer does not appear to be integrated with them at all.

It would be regrettable if the web platform ended up with multiple disjoint ways of accessing the filesystem, where FileHandle and FileSystemHandle are totally unrelated objects.

guest271314 commented 3 years ago

Those three technologies seem to relate to and integrate with each other to various extents, but this Explainer does not appear to be integrated with them at all.

File System Access API (Native File System) provides a means to write and read files to local filesystem, from any origin. There is a method described to write only to sandboxed "origin" described in that specification https://wicg.github.io/file-system-access/#sandboxed-filesystem. I have used the method locally for testing, though sparingly for experiments compared to using the File System Acess API in conjunction with inotify-tools outside of the scope of the specification, yet within the capabilities, to execute arbitrary shell scripts and run native applications https://github.com/guest271314/requestNativeScripts.

File API is used in HTML <input type="file">, among other API's, including FormData; Drag and Drop; Clipboard API; et al.

File and Directory Entries API is suited for iteration of directories and multiple files uploaded at <input type="file"> with and without webkitdirectory and allowdirs, multiple attributes set https://stackoverflow.com/q/39664662, not for writing files, though technically a FileList could be programmatically created and set with arbitrary File objects using DataTransfer https://stackoverflow.com/questions/47119426/how-to-set-file-objects-and-length-property-at-filelist-object-where-the-files-a.

The FileSystem API is also still in existence https://stackoverflow.com/questions/37502091/how-to-use-webkitrequestfilesystem-at-file-protocol and in use in code at GitHub repositories.

The plain language of this explainer isolates access to "origin", "sandboxed" in some form of "storage". That model resembles FileSystem API.

Note: While user agents will typically implement this by persisting the contents of this origin private file system to disk, it is not intended that the contents are easily user accessible.

The technologies relate to the extent that File and Blob objects are exposed to user for data storage in memory and on disk , not necessarily to original intent of every API that happens to use Blob and File or FileList in the API algorithms.

An API's use of "file" or formally File or Blob does not mean this API is the same as to intent or purpose as File API, File System Access API, or File and Directory Entries API.

This explainer begins with the premise that some users expect a specifcation to be developed and implemented to isolate their data, in a "sandbox", "not intended that the contents are easily user accessible" which is a reasonable use case, and clear intent of this explainer.

Ultimately the File or Blob, or other data storage technique employed, whether "sandboxed" or not, is written to users' memory and, or hard disk https://stackoverflow.com/a/56419176.

Note, though not specified, it is also possible to read and write files stored at "sandboxed" origin (browser configuration folder) at command line https://stackoverflow.com/questions/36098129/how-to-write-in-file-user-directory-using-javascript/36098618#36098618.

File System Access API begins with the premise that the user themselves will provide permission to read and write directly to their own filesystem, un-"sandboxed". That is the distinction.

guest271314 commented 3 years ago

Note: read and write currently take a SharedArrayBuffers as a parameter. This is done to highlight the fact that it might be possible to observe changes to the buffer as the browser processes it. The implications of this and the possibility of using simpler ArrayBuffers are being discussed.

is intereting.

At 32-bit found through testing that there is a limit at Chromium as to how much a SharedArrayBuffer from WebAssembly.Memory.grow() can actually grow https://bugs.chromium.org/p/v8/issues/detail?id=7881.

File System Access API does not currently support reading a file while the file is being written as a stream https://bugs.chromium.org/p/chromium/issues/detail?id=1084880 without reading the entire file and checking size https://github.com/guest271314/captureSystemAudio#stream-file-being-written-at-local-filesystem-to-mediasource-capture-as-mediastream-record-with-mediarecorder-in-real-time, where ideally, weshould be able to do

        start.onclick = async e => {
          class AudioWorkletProcessor {}
          class AudioWorkletNativeFileStream extends AudioWorkletProcessor {
            constructor(options) {
              super(options);
              this.byteSize = 512 * 344 * 60 * 50;
              this.memory = new Uint8Array(this.byteSize);
              Object.assign(this, options.processorOptions);
              this.port.onmessage = this.appendBuffers.bind(this);
            }
         // ..
         try {
            start.disabled = true;
            controller = new AbortController();
            signal = controller.signal;
            const { body: readable } = await fetch(
              'http://localhost:8000?start=true',
              {
                cache: 'no-store',
                signal,
              }
            );
            aw.port.postMessage(readable, [readable]);
          } catch (e) {
            console.warn(e);
          } finally {

          }
        };
        stop.onclick = e => {
          controller.abort();
          start.disabled = false;
        };

<?php 
  if (isset($_GET["start"])) {
    header("Access-Control-Allow-Origin: *");
    header("Content-Type: application/octet-stream");
    echo passthru("parec -v --raw -d alsa_output.pci-0000_00_1b.0.analog-stereo.monitor");
    exit();
  }

on a File being written to, substituting something like FileSystemFileHandle.readable for fetch(). Use of SharedArrayBuffer also is a distinction between this explainer and the API's listed at OP, for users that are expecting "sandbox" storage specification and implementation.

These are just observations.

I found this repository this time while experimenting with means of communication between localhost and any origin, presently with QuicTransport - to determine if there is any simpler way to run shell scripts and native applications from JavaScript from the browser and get the results as a file other than the ways I have already achieved, namely using File System Access API and inotify-tools. The origin isolation mandate appaear to rule out this proposal exclude this API from that capability set - without investing time into isolating where in the user data directory Chromium stores the data.

othermaciej commented 3 years ago

This explainer begins with the premise that some users expect a specifcation to be developed and implemented to isolate their data, in a "sandbox", "not intended that the contents are easily user accessible" which is a reasonable use case, and clear intent of this explainer. ... File System Access API begins with the premise that the user themselves will provide permission to read and write directly to their own filesystem, un-"sandboxed". That is the distinction.

I believe this claim is incorrect. File System Access API offers access to an origin private filesystem in addition to providing ways for the user to grant access to portions of the native filesystem. Does this API provide access to the same virtual per-origin filesystem or a different one? The explainer is silent on this but from the Chrome implementation bug is seems like it uses the same underlying storage.

Even if the claim was correct, it would still be wrong to have different and totally incompatible APIs for real files and sandboxed virtual files. That's imposing needless complexity on developers using the web platform.

guest271314 commented 3 years ago

@othermaciej FWIW I tend to agree with your general analysis. This could be incorporated into File System Access API, even the interesting experimental usage of SharedArrayBuffer. Presently I have no use case to test the API. I can just write files directly to the filesystem, without concerning myself with "origin" or "isolation" or searching through ~/,config/chromium to determine where the data is being stored. Chromium has already shipped the API. That does not mean other browsers need to follow just because Chromium did so.

Would much prefer a single File System Access API at Firefox than for Firefox to take time trying to decide what NativeIO is and taking the time comparing with File System Access API thereto, particularly given Mozilla recently announced changes in their operational structure. File System Access API, in general provides covereage for both sandboxed and un-sandboxed file system reads and writes.

othermaciej commented 3 years ago

I agree that it would be better to fold in any new capabilities here to File System Access API. If there's a need for this proposal to be worked on separately before it can be merged, then monkey patching File System Access API would be a better temporary measure.

guest271314 commented 3 years ago

These experiments https://github.com/fivedots/nativeio-porting-tutorial, https://github.com/fivedots/emfs when expanded for given use cases can be useful towards developing the capability to establish persistent watching of a file or directory for events (read, write, modify) https://bugs.chromium.org/p/chromium/issues/detail?id=1019297 in JavaScript and to read a FileSystemFileHandle as the file is being written to by a non-web application, without the need to read the entire file multiple times, similar to or the same as I am able to stream data to the browser using fetch() and php passthru() (currently working on a QuicTransport() version). What is possible to achieve at proof-of-concepts relevant to complete control over the file system read, write, stream, events and notifications, et al. procedures at the above repositories should be possible to implement within the scope of not only private origin filesystem, also at native file system via File System Access API.

Monkey patches can lead to unintended consequences that can remain in spite of a initial temporary intent, for example, there is still code that uses URL.createObjectURL(mediaStream) in the wild that works as intended otherwise https://github.com/kazuki/video-codec.js/pull/12. File was temporarily transferable, then no longer transferable https://github.com/w3c/FileAPI/issues/32. Though API's should be conceived of as having the capacity to expand beyond initial use cases, without necessitating creating an entirely different specification, then merge into the open-mindedly conceived specification that is well-equipped to merge additional sections of content into the overall document. I have been banned from several oganizations that I attempted to contribute to or join, so take my input at face value, I just compose and test code in the field to the point it breaks.

guest271314 commented 3 years ago

@othermaciej

I agree that it would be better to fold in any new capabilities here to File System Access API.

What do the principal parties think about that idea?

fivedots commented 3 years ago

I'd like to thank both of you for the feedback and discussion, it's been very helpful!

We are currently talking with the owners of the File System Access API to explore the possibilities. I'll update this issue soon with the conclusions that come out of that.

othermaciej commented 3 years ago

Thanks for the consideration, @fivedots . I'm pretty confident this strategy can work. If there's tricky design problems with it I'd be happy to help brainstorm solutions. The web will be much better off if we have just one API that offers virtual sandboxed filesystem access.

fivedots commented 3 years ago

Hello, I wanted to give you an update. We are continuing our discussions with the Chrome storage team. Our main fear is that by merging with the File System Access API, we will compromise the goals of either the Origin Private File System or NativeIO. In particular, there is a risk that the high level concepts used in File System Access API might bind us to slower performance. We are considering adding a special set of functions to the Origin Private File System, but the risk there is that, by breaking symmetry with File Systems Access API, we will end up with a less coherent interface and more cognitive load on the developers.

That being said, we understand why simplifying the platform by merging (if we can do it without strongly impacting use cases) would lead to a better result. We are working on benchmarks that should shed some light on the compromises that would have to be made, after we have that data we can figure out which trade-off is the better one. My proposal is that we temporarily pause this discussion, I will ping it again when we have more data.

Also thanks for the offer @othermaciej, it would be great to collaborate. Maybe we can discuss it after the benchmarks are done!

fivedots commented 3 years ago

Hello, I want to give you another short update:

We have made progress on creating a couple of benchmarks that should help inform our decision, the next steps are to make them easily reproducible and publish the results. We are still in quite close contact with the storage team, and the question of how we relate to Filesystem Access API is still at the top of our list.

We've had to pause our efforts on this to shift our focus to meeting a couple of important deadlines that are coming up. Still, I wanted to assure you we remember this issue and that we want to properly resolve it as soon as we get some spare cycles back.

I also wanted to point out that we've asked for an official position from WebKit here.

I'll keep you posted as things develop!

othermaciej commented 3 years ago

I'm not sure why benchmark results are relevant. If an implementation of this new API is faster than using existing Filesystem Access API, then that would not be a reason to create a new wholly separate file API. Instead, it would suggest that either (a) the implementation of Filesystem Access API needs to be optimized; or (b) if needed, additional API surface should be added to Filesystem API to enable better efficiency; (c) or both. I do not see how any benchmark result would show that it is a good idea to create a completely disjoint notion of a file handle with its own new API, that cannot even be converted back and forth to the existing kind.

guest271314 commented 3 years ago

What is missing from File System Access API are

Of interest re "benchmarks", I do not know if Storage Foundation API has been approached by Chrome "security team" to internally scan files written by user with proprietary "Google Safe Browsing" https://safebrowsing.google.com/ - which is not disclosed in the File System API specification whatsoever https://wicg.github.io/file-system-access/ - which would certainly explain why File System Access API is slower than Storage Foundation API: the file is written twice.

The two list items above are criteria enough alone for an API that performs those two procedures without undue and undisclosed restrictions, as those are serious impairments that will continue to be a problem for users in the filed that have to keep filing bugs to, perhaps, eventually, get the truth behind why File System Access API behaves the way that it does now https://bugs.chromium.org/p/chromium/issues/detail?id=1168715#c17.

RReverser commented 3 years ago

Capability to read and write only portion of a file, without reading the entire file

You can certainly do both of those operations already. You can read slices from the file, as well as write to a given position.

guest271314 commented 3 years ago

You can certainly do both of those operations already. You can read slices from the file, as well as write to a given position.

Not without reading the entire file into memory. If you refute that fact kindly post a minimal, complete, verifiable example of that procedure, here.

And evidently proprietary "Google Safe Browsing" algorithms are being baked in to the process https://wicg.github.io/file-system-access/#malware-scans-and-safe-browsing-checks, which means worse case scenario Google is reading and analyzing every byte written and read, at best case scenario "virus" and "malware" "protections" will still fail when the "virus" or "malware" is not already known to the algorithms https://security.stackexchange.com/a/202306.

guest271314 commented 3 years ago

Capability to read and write only portion of a file, without reading the entire file

You can certainly do both of those operations already. You can read slices from the file, as well as write to a given position.

If that were true and correct I would not have to read the entire file here https://github.com/guest271314/captureSystemAudio/blob/master/native_messaging/file_stream/app/captureSystemAudio.js#L159.

kaizhu256 commented 3 years ago

You can certainly do both of those operations already. You can read slices from the file, as well as write to a given position.

is there a pathway to integrate slicing and appending with wasm-sqlite? i would prefer if that's possible, but would settle for storage-foundation-api if its not.

guest271314 commented 3 years ago

@kaizhu256 I do not know what wasm-sqlite does. When I tried to read part of a file using File System Access API (Native File System) that is simply not possible right now. In brief see https://bugs.chromium.org/p/chromium/issues/detail?id=1084880. Reinterated in https://bugs.chromium.org/p/chromium/issues/detail?id=1168715, which reveals that Google "security team" is internally pushing their algorithms into the file writing and reading process https://bugs.chromium.org/p/chromium/issues/detail?id=1168715#c17

The problem here is indeed that our security team really wants us to perform safe browsing analysis of written files, before the files are available with their normal file name/extension (it is also for this reason that downloads are written to a temporary file, and only renamed once safe browsing checks pass).

If anything that should be user opt-in, not compulsory. What happens when I turn off proprietary "Google Safe Browsing" at Chrome settings? The user does not know. Given that temporary files are still written when "Google Safe Browsing" is disabled/turned off and that the reason given for writing temporary files, filtered through some unknown algorithms, "Google Safe Browsing" algorithms are still not turned off even though I manually turned off that setting. Or, if "Google Safe Browsing" really is turned off relevant to File System Access API when I turn off "Google Safe Browsing" for the entire browser, temporary files are being written for no reason, because I want no parts of "Google Safe Browsing" verified by turned off the setting.

ddumont commented 3 years ago

I would really like some of the concepts you talk about here to make it into the actual File System API. The shortcomings you mention above are a complete failure of that api and need to be solved, not side-stepped.

We are still in quite close contact with the storage team, and the question of how we relate to Filesystem Access API is still at the top of our list.

@fivedots I am very curious to know through what channels you are in close contact with the storage team. I very much would like to be able to speak with them about some of the issues linked here (one is a bug I opened)

I appreciate the effort involved in this project. I really do hope it pushes the chrome storage team to make their api better, because it's not very useful right now.

jimmywarting commented 3 years ago

capability to read and write only portion of a file is possible. even truncating/setLength

// Reading part of the file
await (await fileHandle.getFile()).slice(0, 100).arrayBuffer()

// writing part to the file
var writable = await fileHandle.createWritable()
writable.write(data, offset)
writable.seek(offset)
writable.close()

But the performance penalties off chromes atomic copy-modify-replace and being more secure is a no brainer for me also that needs to be addressed by bringing back/in some "inPlace" option. I think they should at least not scan the private sandboxed storage.

guest271314 commented 3 years ago

capability to read and write only portion of a file is possible.

Not really

await fileHandle.getFile()

reads the entire file before slice() is executed.

guest271314 commented 3 years ago

I think they should at least not scan the private sandboxed storage.

When Google Safe Browsing is turned off manually by the user file scanning should be turned off. That is not the case. File scanning should be opt-in. Currently there is no way that I am aware of to opt-out.

jimmywarting commented 3 years ago

From what i understood by chrome's security team they will only copy(read)-modify-replace when writing data.

fileHandle.getFile() don't read the entire file into memory or anything like that, it just creates a blob references pointing to somewhere on the disk. Blob's can act "somewhat" like a fileHandle also, when you slice you create a new blob with the same underlying references but with just another offset and size property

guest271314 commented 3 years ago

fileHandle.getFile() don't read the entire file into memory or anything like that

How do substantiate that claim with tests? As far as I can tell there is no way (currently) to read only part of a File or Blob using any existing API.

jimmywarting commented 3 years ago

How do substantiate that claim with tests?

create a <input type="file" multiple> select a few GiB of files concatenate all into one large blob or file

blob = new Blob([...input.files]) // now you have a 20gb+ large blob and it took just a few ms observe the memory used, still very small

slice this large blob into 1/3 and take the middle piece observe the memory used, still very small

everything you keep in the memory is only a pointer to some other place with a size+offset combo of where it should read the data from... it can be of multiple sources.

I have observed chrome's blob System Design and read there presentation of blob internal structure

i helped fetch-blob with this issue to create a new kind of blob structure that follows the spec more clearly by operating on blob parts rather then a byte source (one single large arraybuffer) so blob's can be read lazy the Spec says: For each element in parts: "If element is a Blob, append the bytes it represents to bytes." this sentences is not quite clear but it dose not mean copy over all data. This made it possible for me to wrap a file path with an offset+size into what chrome describe as blobDataItem in their system design without holding anything in the memory - now node.js is experimenting with adding a blob class into the core that also can be backed up with pointers to some place on the disk

i do not know how mkruisselbrink is but he state that chrome only stat-ing the file to get its size and last modified date. when calling handle.getFile() https://github.com/WICG/file-system-access/issues/101#issuecomment-612103114

jimmywarting commented 3 years ago

If you take any file from a file input then you will not have a snapshot of this file if you modify this file then the last modified date will change and reading this blob afterwards will throw an out of sync error of some sort

guest271314 commented 3 years ago

<input type="file" multiple> is a special case that actually does get the file from disk. That is not always the case. And, again, there is no way to get around the fact that getFile() is being called first, which reads the entire file (wherever those BlobParts might be) Where is Blob binary data stored?.

jimmywarting commented 3 years ago

Okey i can see your skepticism. so i conducted a test https://jsfiddle.net/c2epLz0n/1

failing to read the File instance after it have been changed is a indication to me at least that the underlying data have been modified so it should fail to read the blob. for me this sound like a blob only have a pointer references to some place on the disk with size+offset (and lastModified) of where to read the data from. if the data would have been held in the memory or being some kind of snapshot then it would have been fine to read the blob but it wasn't

slicing the File or constructing a new blob with that part after it was modified also fails to read the data

So my claim still holds: handle.getFile() don't read the entire file - it only stat-ing some information about the file i could not see any memory spike that was raised


Try reading a very large file that would take up to a few sec to read the entire file with c or some other native code. then measure how fast it takes to get a File instances from handle.getFile() it will roughly take the same time it takes to get some metadata information about a file

mkruisselbrink commented 3 years ago

This seems somewhat off-topic for this issue, but handle.getFile() indeed is not supposed to read the entire file, it merely stats the file to figure out the last-modified timestamp. I believe there are some edge cases with particular chrome os file system backends where calling getFile() will end up reading the entire "file" because that is the only way to determine for example the size of the file, but those edge cases should be pretty rare and shouldn't generally effect things.

guest271314 commented 3 years ago

Try reading a very large file that would take up to a few sec to read the entire file with c or some other native code. then measure how fast it takes to get a File instances from handle.getFile() it will roughly take the same time it takes to get some metadata information about a file

I did, here https://github.com/guest271314/captureSystemAudio/blob/master/native_messaging/file_stream/app/captureSystemAudio.js#L153 through to https://github.com/guest271314/captureSystemAudio/blob/master/native_messaging/file_stream/app/captureSystemAudio.js#L203, in pertinent part

          async function* fileStream() {
            while (true) {
              let fileHandle, fileBit, buffer;
                // if exception not thrown slice file from readOffset, handle exceptions
                // https://bugs.chromium.org/p/chromium/issues/detail?id=1084880
                // TODO: stream file being written at local filesystem
                // without reading entire file at each iteration before slice
                fileHandle = await dir.getFileHandle('output', {
                  create: false,
                });
                fileBit = await fileHandle.getFile();
                if (fileBit) {
                  const slice = fileBit.slice(readOffset);
                  if (slice.size === 0 && done) {
                    break;
                  }

Per your analysis

                fileHandle = await dir.getFileHandle('output', {
                  create: false,
                });
                fileBit = await fileHandle.getFile();

does not get and read (wherever FileBit's might be across actual disk (<input type="file">, TypedArrays, etc.)) the entire file is only read at

                  const slice = fileBit.slice(readOffset);

Is that an accurate description of what you observe?

What chrome://blob-internals prints for construction of Blob and Files is can be ignored?

How exactly can a file be sliced into arbitrary parts without reading the entire file?

The reading and writing does not maintain the same rate (time to complete each read/write) throughout the example 1 hour process.

This seems somewhat off-topic for this issue, but handle.getFile() indeed is not supposed to read the entire file

From perspective here reading and writing to filesystem, both the time and resources required, and what the algorithm of the implementation actually does, are the topic, else Storage Foundation API would not be distinguishable from othre API's. To some extent that is what is being suggested. There are several stakeholders and various use cases, for some reason users, specification authors and implementers have found this Issue.

I proposed as FUGU for there to be a means to create a WebAssembly.Memory (that can grow()) instance that can be written to from JavaScript and read from (and written to) in a native language https://bugs.chromium.org/p/chromium/issues/detail?id=1115640 where the user can write to the same memory using a TransformStream, where there is no concept of temporary data stored somewhere, there is one chance to read and write before the stream is used or closed.

I then began experimenting with WebTransport which is suited to achieve that goal. I created a very basic file writer, or download shell script to avoid creating temporary files, which, without question is occurring, the reading part I need more evidence to exclude that getFile() does not get the entire file before slice() is called

#!/bin/bash
echo "$1" | cat > "$2"
echo 'ok'
async function download(filename = 'download') {
  const url = `quic-transport://localhost:4433/download?filename=${filename}`;
  try {
    const transport = new WebTransport(url);
    console.log(transport);
    transport.onstatechange = async (e) => {
      console.log(e);
    };
    await transport.ready;
    transport.closed
      .then((reason) => {
        console.log('Connection closed normally.', { reason });
      })
      .catch((e) => {
        console.error(e.message);
        console.trace();
      });
    const sender = await transport.createUnidirectionalStream();
    let n = 0;
    const rs = new ReadableStream({
      async pull(c) {
        if (n === 1499) {
          c.close();
        }
        if (n === -1) {
          if (rs.locked && sender.writable.locked) {
            // c.close();
            sender.abortWriting({ errorCode: 0, reason: 'Download aborted' });
          }
        }
        if (rs.locked && sender.writable.locked) {
          console.log({ n });
          c.enqueue(n + '\n');
          await new Promise((r) =>
            setTimeout((_) => {
              ++n;
              r();
            }, 20)
          );
        } else {
          console.log('should be unlocked', n, sender.writable, rs.locked);
        }
      },
      cancel(reason) {
        console.log(`ReadableStream cancellation reason: ${reason.message}`);
        transport.close({ errorCode: 0, reason: 'Aborted downloading' });
      },
    });

    const input = await rs
      .pipeThrough(new TextEncoderStream())
      .pipeTo(sender.writable)
      .catch(async (e) => {
        console.log('writingAborted:', await sender.writingAborted);
        console.log(
          'incomingUnidirectionalStreams',
          await transport.incomingUnidirectionalStreams.cancel(e.message)
        );
        throw e;
      });
    console.log(input);
    const reader = transport.incomingUnidirectionalStreams.getReader();
    console.log({ transport, sender, reader });
    const result = await reader.read();
    if (result.done) {
      console.log(result);
      return;
    }
    let stream = result.value;
    console.log({ stream });
    const { readable } = stream;
    console.log(
      await readable.pipeThrough(new TextDecoderStream()).getReader().read()
    );
    console.log({ reader, transport });
    await reader.cancel();
    transport.close({ closeInfo: { reason: 'Done downloading' } });
    console.log(await reader.closed, await stream.readingAborted);
  } catch (e) {
    console.warn(e.message);
    throw e;
  }
}

download('download').then(console.log).catch(console.warn);
jimmywarting commented 3 years ago

...That was a long comment

Just a tip

L153-L203 is a better way to link to a piece of code (click on the first line, hold shift and click on the last line - you can right click on this to get a permalink with the selected lines) 😉

it's also nice if folks add syntax color to there snippets

file is only read at

const slice = fileBit.slice(readOffset);

Is that an accurate description of what you observe?

Nah, when you slice a file you just merely create a new blob that have the same references to the original file with another size + offset (it's like a new blob that is instructed to read from the original blob from a to b) if it would have to read the file when you slice a blob then it would have been a huge main-thread blocker and the slice would not have been synchronous - it would have to be async in that case

the data is only read when you call blob.text(), blob.arrayBuffer(), blob.stream() or if you use the fileReader or something else that will read the blob like fetch...

How exactly can a file be sliced into arbitrary parts without reading the entire file?

like i have said 2 or 3 times already, internally it will clone the blob and only change the size and offset property - it is like if you are still using the same blob in the sliced blob version this will instruct the internal stream reader to read the original blob-parts from point a to b instead.


If you wish to learn more about how the blobs work in browser more deeply with source code An example of this can be found in fetch-blob particularly in this PR: https://github.com/node-fetch/fetch-blob/pull/44/files#diff-e727e4bdf3657fd1d798edcd6b099d6e092f8573cba266154583a746bba0f346R134-R166 or this revision https://github.com/node-fetch/fetch-blob/blob/aa801ae2f9d0f1af6e47e1613e976353bac41b10/index.js#L134-L170

(i used the term span instead of offset since it was used before the change was made to operate on pre existing blob parts instead)

the new blob will have the same underlying blob parts the only thing that changes is the new size and offset This is roughly how browser dose it also (not many developers know how blob works under the hood - if they did then they would have a better understanding of them and use it in a "better" way)

The so called BlobDataItem equivalent in chrome design that can be a Blob but not have any data is this file: https://github.com/node-fetch/fetch-blob/blob/master/from.js ... that wraps a file path into a blob backed up by the file system It's one of the internal blob types that we wrap in to a unified blob interface you can see that it has no read behavior when you slice it... the only thing that changes is the offset

guest271314 commented 3 years ago

Nah, when you slice a file you just merely create a new blob that have the same references to the original file

That is not what occurs in the example I posted, it is not the same file because I am writing to the file in parallel using native code

#!/bin/bash
captureSystemAudio() {
  parec -v --raw -d alsa_output.pci-0000_00_1b.0.analog-stereo.monitor | opusenc --raw-rate 44100 - - \
    | ffmpeg -y -i - -c:a copy $HOME/localscripts/output.webm
}
captureSystemAudio

thus the file handle throws an exception https://bugs.chromium.org/p/chromium/issues/detail?id=1084880#c21

DOMException: A requested file or directory could not be found at the time an operation was processed.

is handled 2755 times.

Perhaps I am not being clear about what I am requesting. The complementary issue for writing a file without creating temporary files. When I get permission for a file handle, I should be able to stream the new contents only, and for the handle to not throw because the underlying file changed. Not the existing stream() method which reads the entire file (at that time) to completion, rather until I close the stream, for that a FileHandle readable attribute could be utilized, where slice() would not be necessary at all. The writable issue is clear enough, scanning (the only rationale given for creation of temporary files) should be opt-in, not compulsory. Similarly, if the application (implementation) is capable of detecting changes to the file referenced by a file handle, then do not throw exception when file changed, and stream the data until I close the stream. I suggest that you try the code I posted and verify for yourself if the rate of reading remains the same throughout the procedure.

guest271314 commented 3 years ago

Another way to put it is how expensive is FileHandle, File and Blob creation (zero creation, copy, reference, memory, time expense)?

RReverser commented 3 years ago

Can we please hide the off-topic comments and move them to a separate thread? There are people who're subscribed to this thread to get updates to a specific question.

fivedots commented 3 years ago

I've hidden some comments in order to keep the issue on topic, i.e. Storage Foundation's relationship to other storage APIs and the state of our conversation with the Chrome Storage Team re: similarities with the Origin Private File System. If you'd like to continue the hidden detailed discussion, please open another issue.

rstz commented 3 years ago

(CC: @annevk , since we got feedback from him regarding this issue)

Hello all,

We've been exploring a few ways to unify Storage Foundation with the File System Access API, by extending the surface of the Origin Private File System. The options we've considered so far can be found here. It would be great to have input on what you consider the best way forward, so please let us know what you think!

annevk commented 3 years ago

Thanks @rstz for following up! The stream-based approach looks promising and as @domenic notes an improvement over the status quo of FileSystemFileHandle. I've asked others at Mozilla to take a look as well.

fivedots commented 3 years ago

After looking at the feedback on the options to merge Storage Foundation API and File System Access API (mentioned here), we've written a more concrete proposal. It describes a few extensions that could be made to the Origin Private File System in order to support our use cases.

Feedback is very welcome!

jimmywarting commented 3 years ago

Creating a File through getFile() is possible when a lock is in place. The returned File behaves as it currently does in OPFS i.e. it is invalidated if file contents change after it was created. In our particular case this means that Files created while there is an active handle will be invalidated when a flush is executed (either explicitly through flush() or implicitly by the OS). It also means that these Files could be used to observe flushed changes done through the new API, even if a lock is still being held.

It would have been nice if it didn't invalidated the file so that you could create a File instance + a ObjectURL and play a video while you at the same time download something for streaming compatibility. (so that no modification to the file invalidates the file)

var x = await y.createFile('video.mp4')
await x.truncate(videoFileSize)
var file = await x.getFile()
video.src = URL.createObjectURL(file) // Start watching

// then download, write and stream the video at the same time
x.writer(buffer, offset)

Kind of how it works with VLC when you don't have the hole content yet. you can look at the video while it's still downloading. Faced similar issue before... Didn't want to use Media Source Extension (since it's so complicated to support many format and do seeking and other stuff) ended up using service worker + evt.respondWith(new Response(new ReadableStream(...))) instead

mkruisselbrink commented 3 years ago

Unfortunately a File (or Blob) object fundamentally represents a fixed set of bytes, and a lot is built on top of that assumption. I.e. slicing a Blob, or creating another Blob containing a Blob merely add another reference to the same underlying data, but don't copy any data (also postMessaging a Blob to a different origin merely shares a reference to the same data). As such any API based around Blob can't represent a file after is is modified. Being able to stream from a file while it is being written to is a useful feature though, and one that might be achieved by the "readable" exposed by the API proposed in this thread (although only if the readable and writable do not share a cursor? If they share the same cursor presumably it wouldn't work to streaming read from the file while also writing to it).

jimmywarting commented 3 years ago

I know that they are immutable and represents a fixed set of bytes and slicing and constructing a new blob with that blob would be tough (built a spec compliant fetch-blob package after all - so i kind of know how they work internally)

My point was just to bring out the useful feature if it where possible to do something like it. to try and hatch some new ideas/features

I guess it can be possible with chromes old sandboxed filesystem where you can get a filesystem url I'm not sure if it works but i think you are able to watch the movie while modifying/appending data to the file at the same time

video.src = fileHandle.toURL() // "filesystem:https://example.com/temporary/video.mp4"

Then you are not using a immutable blob


The browser will realize it has a content-length and a accept-bytes header so it can abort and just partially download a chunk of the beginning and when the browser needs more it would just make a new partial request when needed (at which point you have already written the data beforehand) knowing this headers also makes it seekable


What if you could create a object url from a fileHandle or cursor and not from a blob? URL.createObjectURL(fileHandle)

asutherland commented 3 years ago

In discussions relating to ServiceWorkers and I think Storage at TPAC, there's generally been a desire to specify/implement the existing HTMLMediaElement.srcObject more broadly, like on <img> as discussed in the WHATWG HTML spec, and avoid introducing any more uses of URL.createObjectURL since it's a very easy way to accidentally leak a lot of memory and because there's been inter-op and privacy/tracking problems in terms of where that URL is valid outside a given document.

Note that the discussion for ServiceWorkers was primarily dealing with letting Response objects be directly fed to DOM objects in Window contexts without needing to involve a ServiceWorker (or Blobs/Files).

jimmywarting commented 3 years ago

☝️ that's neat

jespertheend commented 3 years ago

Fwiw I think the streams based approach is an excellent way to do it. Having this proposal be an extension on top of OPFS sounds ideal to me, and I’m glad this is being explored. This way OPFS and FSA can benefit from the same performance additions that Storage Foundation would add, without there being any confusion between what apis to use.

othermaciej commented 3 years ago

After looking at the feedback on the options to merge Storage Foundation API and File System Access API (mentioned here), we've written a more concrete proposal. It describes a few extensions that could be made to the Origin Private File System in order to support our use cases.

Any new updates on this proposal? Does it seem likely to move forward?

Also, sorry for not giving feedback on this proposal earlier. I've circulated the proposal among some of my Apple colleagues, and we like the stream-based API; it seems like that's what was selected anyway.

We will have more comments on specific details once there's a PR (or delta draft) to review.

tomayac commented 3 years ago

There is now a concrete proposal to add a createAccessHandle() method to the FileSystemFileHandle object, which happens in the context of merging this API with the Origin Private File System.