beeware / toga

A Python native, OS native GUI toolkit.
https://toga.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
4.38k stars 684 forks source link

Cross-platform, high level file I/O module #1171

Open t-arn opened 3 years ago

t-arn commented 3 years ago

Is your feature request related to a problem? Please describe. There is the problem that in Android 11 the standard file I/O will not be supported anymore on external storage. All file access on external storage must be done with the SAF (Storage Access Framework) https://developer.android.com/about/versions/11/privacy/storage

Describe the solution you'd like I would like to have an abstract, cross-platform, high level module for handling file I/O. Files should not be referenced with a file path but rather with an URI. On Android, they would be content URIs (content://), on other platforms, we could use file URIs (file://) or whatever fits the platform.

The module should provide about following functions:

mhsmith commented 2 years ago

For previous discussion, see #1158.

freakboy3742 commented 2 years ago

Some additional design notes:

It may be worth investigating existing libraries like intake that exist to abstract data access from raw filesystem operations.

coolcoder613eb commented 1 year ago

what about having an api for appdata folder path, Documents etc.

freakboy3742 commented 1 year ago

@coolcoder613eb We have one of those - however, that API is necessary, but insufficient for the general problem. The issue is that on Android, some file locations aren't accessible using standard Unix file I/O APIs; and "cloud"-based locations require a different approach again. This ticket is covering the idea of providing an abstraction over those alternate file sources.

t-arn commented 1 year ago

I have been working on this issue and, as a proof-of-concept, I created a cross-platform uri_io package: https://bitbucket.org/TomArn/tatogalib/src/master/src/com/t_arn/pymod/uri_io/

The online documentation can be found here: https://www.tanapro.ch/products/taTogaLib/docs/html/uri_io/index.html

freakboy3742 commented 1 year ago

Thanks for the update; however, I'm not sure I see how what you've got here meets the needs of a high level Python file I/O module (at least, from Toga's perspective).

A local file in Python is accessed using:

path = Path("/path/to/file.txt")
with path.open('r') as f:
    content = f.read()

To my understanding, what we need here is something that satisfies the API:

path = URIResource("https://example.com/path/to/file.txt")
with path.open('r') as f:
    content = f.read()

As an implementation note, this direct approach is likely to have poor performance; synchronously loading a web resource is going to be a blocking activity, so adopting an approach to async file I/O (like aiofiles) will likely be necessary in practice.

On an implementation note, your code has an... eccentric package structure. Why have you adopted a com.t_arn.pymod Python namespace? That approach to namespacing won't work with Python. Python package names have to be unique, unless you're using namespace packages, which you're not. You've staked out the "com" namespace... and no other package will be able to use the "com" namespace. If you were planning to be compatible with "com.example.pymod"... you won't be.

t-arn commented 1 year ago

Hello @freakboy3742, thanks for your comments about the package namespace. I will change it.

And yes, the reading/writing should be async - I was just too lazy so far.

Regarding your code expectation: Except for the naming I think we are not that far apart. This code works:

path = UriFile(self.app, self.ti_source.value)
f = path.open_raw_inputstream()
content = f.read()
f.close()

But this code does not:

path = UriFile(self.app, self.ti_source.value)
with path.open_raw_inputstream() as f:
    content = f.read()

I get this message: 'UriInputStream' object does not support the context manager protocol

I have no idea how to add the context manager protocol. Is this difficult to implement? Is there some example somewhere?

freakboy3742 commented 1 year ago

The Context manger protocol is fairly straightforward to implement - you define 2 methods (__enter__ and __exit__) on the class you want to act as a context manager. __enter__ is what is executed when you enter the with statement, and returns the "context" (e.g., the open file); __exit__ is whatever cleanup needs to occur (e.g., closing any open file handles).

t-arn commented 1 year ago

Sounds easy enough. In __exit__, I will close the stream, right? What should I do in __exit__ when there was an exception? Just propagate it by returning False?

freakboy3742 commented 1 year ago

On file-like resources __exit__ usually closes those handles, so closing the stream would seem appropriate. As for exception handling, propagating sounds like the best approach. The option to do something else is there when an exception is "normal" behavior for some reason (e.g., catching StopIteration errors for a resource that is an iterable). If something has genuinely gone wrong, propagating that error (after doing whatever cleanup is possible) sounds like the right approach.

t-arn commented 1 year ago

I changed the package namespace and I added the context manager protocol to UriInputStream and UriOutputStream. https://bitbucket.org/TomArn/tatogalib/src/master/src/tatogalib/uri_io/

The online documentation can be found here: https://www.tanapro.ch/products/taTogaLib/docs/html/uri_io/index.html

@freakboy3742 Now, the following code works, which is pretty close to what you expected:

uristring = "file://C:/Program%20Files/test.pdf"
# for android: "content://com.android.externalstorage.documents/document/primary%3ADownload%2FTest-dir%2Ftest.pdf"
path = UriFile(uristring)
with path.open_raw_inputstream() as f:
    content = f.read()

As for the async implementation: For long file operations, it would probably make sense if the user could show a progress dialog while the operation runs. How could this be implemented best? By passing a callable which expects the current progress (in %) as a param and then calling this callable from the loop which reads / writes blocks?

freakboy3742 commented 1 year ago

Any reason for using open_raw_inputstream() rather than just open()? Are there likely to be other types of "open" (and if there is - can they be accommodated with arguments to open()?)

As for progress indicators - I'm not sure I see why additional handling is required. If you're dealing with a small file, f.read() will return quickly; if you're dealing with large files, it's generally a bad idea to read the whole file in one chunk anyway, so f.read(1024) (or similar) is more likely - which also gives you the point at which you can update a progress indicator.

To that end - if f.read() is a blocking network access method, it should probably be an async call.

t-arn commented 1 year ago

Are there likely to be other types of "open"?

Yes, I intend to add following methods:

UriFile.open_text_inputstream(encoding)
UriFile.open_text_outputstream(overwrite_mode, encoding, newline)

The newline argument is missing in open_text_inputstream because for me, only the "universal" newline mode makes sense.

And yes, it would be possible to emulate the standard f.open() method. But I don't like f.open() because it is too broad and therefore, there are dependancies between the arguments mode and encoding / newline that you need to explain in the docs. I don't think this is a good api design

freakboy3742 commented 1 year ago

Well, sure - Python's open() API is absolutely a hangover of UNIX APIs that has leaked into Python API design. However, the counterpoint is that open() is the API used by literally every file-like API in Python. It's the API people expect when they're told something is a file-like object. Consistency with existing examples is a major source of implicit documentation.

At the very least, open_text_inputstream is both verbose and inaccurate. You're not returning an "input stream" - that's just as bad as Python leaking UNIX terminology into its API, because you're leaking a Java API in that naming.

Also: it doesn't matter if it's reading or writing - if it's text, it has encoding. If it doesn't have encoding, it's bytes. At which point, you need to add an open_bytes_input and open_bytes_output...

Or, you can have a single open() method that accepts r, w, rb and wb as arguments, with an encoding argument that controls how text modes are interpreted. :-)

t-arn commented 1 year ago

@freakboy3742 I now replaced the UriFile.open_xxx methods with a single open() method. And I added support for text encoding and decoding. For reading, I use a TextIOWrapper and a BufferedReader which works great. But for some reason, using TextIOWrapper and a BufferedWriter for writing did not work. The problem was that TextIOWrapper.write(content) expects content to be a str and at the raw outputstream, it was still a str and not bytes as the outputstream expected. Do you know why this is?

What is still missing is the async operation mode. readall() for example should be async but will it help to define that method as async when on the platform level, it is just 1 function call (readAllBytes)? It will block anyway, won't it? To become non-blocking, I would need to replace readAllBytes with some read loop where I add a asyncio.sleep now and then, right?

And, read(4096) does not need to be async, but read(-1) which is the same as readall() should be async. How should this be handled?

freakboy3742 commented 1 year ago

The problem was that TextIOWrapper.write(content) expects content to be a str and at the raw outputstream, it was still a str and not bytes as the outputstream expected. Do you know why this is?

TextIOWrapper does everything in text; if you want bytes, you need to use a BytesIO, or perform encoding on any output content.

What is still missing is the async operation mode. readall() for example should be async but will it help to define that method as async when on the platform level, it is just 1 function call (readAllBytes)? It will block anyway, won't it? To become non-blocking, I would need to replace readAllBytes with some read loop where I add a asyncio.sleep now and then, right?

And, read(4096) does not need to be async, but read(-1) which is the same as readall() should be async. How should this be handled?

I don't see an inherent problem with read(-1)/readall() being non-async; it's just not a call you'd actually want to use in practice if there's any chance the network connection is slow or the file is large. This is analogous to time.sleep() working, but being the wrong way to implement a pause. We can't prevent people from doing the wrong thing - we can just provide tools and documentation to do the right thing.

In terms of API, aiofile is as good an example as any of the API to be following. As for how it will be implemented - that's very much down to the APIs you're wrapping. aiofile uses a range of approaches, but the one that is most obviously useful is a thread-based implementation; starting an async read starts a background thread that does the actual read, and sends notifications to event objects that are being awaited by the public APIs when the read completes.

mhsmith commented 1 year ago

There's an existing library which is quite similar to this: universal-pathlib. It's based on fsspec, which is primarily developed by Anaconda, so they should be happy to work with us if we need any changes. But it has a plugin mechanism for adding backends, so we might not need any changes to the core library at all.

t-arn commented 1 year ago

Looks interesting, but I did not find an api for reading / writing files there

mhsmith commented 1 year ago

It looks like it implements the whole pathlib API, including, open, read_bytes and read_text. See here for examples.

freakboy3742 commented 1 year ago

@mhsmith Ooh - I knew about fsspec, but didn't realise there was a Pathlib wrapper that used it as well.

I guess the downside is that it doesn't have an async mode... but fsspec does (albeit only implemented for HTTPS), and it should be a lot easier to add async features to an existing project than to start from scratch.

ItsCubeTime commented 1 year ago

As an implementation note, this direct approach is likely to have poor performance; synchronously loading a web resource is going to be a blocking activity, so adopting an approach to async file I/O (like aiofiles) will likely be necessary in practice.

Personally I find asyncio rather problematic to use as async functions must always be run in an async event loop. Since asyncio only allows 1 asyncio event loop to be run at the same time, this can cause incompatibility issues when attempting to mix various APIs/python modules that wants to create there own asyncio event loops.

Instead, I would suggest to simply instruct the user to run the function asynchronously using a method of there own choice whenever they need to. This of course also assumes that the user is performance aware, but tbh even if you do use asyncio, theres nothing stopping a user from simply prefixing every function call with await anyway. I rarely see people actually use asyncio.gather(). Not using asyncio also allows easier mixing of synchronous and asynchronous code (since you wont have to worry about handling coroutines & creating async contexts).

ItsCubeTime commented 1 year ago

In an ideal world, I also think implementing pathlib & the file operations in the os module would be a good idea to allow use of 3rd party libraries built around pathlib/os. Partial support would be better than no support (I wouldn't say theres an absolute need to support all of the pathlib/os API from day 1 if you were to decide to go down this route).

Also, universal-pathlib looks nice, so does fsspec :)

freakboy3742 commented 1 year ago

Personally I find asyncio rather problematic to use as async functions must always be run in an async event loop. Since asyncio only allows 1 asyncio event loop to be run at the same time, this can cause incompatibility issues when attempting to mix various APIs/python modules that wants to create there own asyncio event loops.

It sounds to me like you've got a misunderstanding of how asyncio works.

All Toga code can already run in an async context. This is fundamentally necessary to prevent beachballing. As soon as the app is running, you have access to the running event loop, and you can add co-routines to that event loop. Even if you use non-async callbacks in Toga, they're still being invoked in an async context.

The "multiple event loop" problem is something that Toga hits because there's a conflict between the OS-level GUI event loop and the CPython asyncio event loop. I can't think of a single library in the CPython ecosystem that truly has it's own main loop - because the main loop is fundamental infrastructure of the async system, not something that an end-library implements. If you've got an app that is using the async keyword... you're pretty much guaranteed it's using Python's asyncio event loop. There are some exceptions to this... but they're not "end user libraries with their own event loops", they're entire alternative asycnio implementations.

Instead, I would suggest to simply instruct the user to run the function asynchronously using a method of there own choice whenever they need to.

This fundamentally doesn't work, as "the method of their choice" won't be integrated with the GUI event loop (at least, not unless they've done a lot of extra work... and I can almost guarantee they haven't).