Closed fititnt closed 2 years ago
Thanks for opening an interesting discussion.
First of all, Orange does not limit the data types you use in widgets, but of course, if you wish to connect these two widgets, they need to have compatible data types. So if you have a bunch of widgets which you'd like to work with large specialized files, feel free to use a data type that is essentially a pointer to a file. And then, after processing, convert it into a small (in-memory) Orange's data Table for interoperability with other widgets.
Memory is handled as in Python: whenever no one has a reference to an object, the garbage collector cleans the space. But yes, after the outputs are sent these objects are referenced by the canvas, because new connections that would need the data could appear after the fact. We do understand that this can be wasteful and have been discussing lazy outputs (see #5720). If you are working with your own data types, feel free to implement them as lazy, as recipes, which are then only evaluated on input. This won't be possible with the current unmodified Table
.
The Table
class usually takes most of the memory. It is composed of 3 numpy arrays (see https://orange3.readthedocs.io/projects/orange-data-mining-library/en/latest/reference/data.table.html). All functions working on these tables try to preserve memory space by reusing them as views whenever possible.
We are also working on a prototype that uses Dask on-disk tables, but that is not functional enough to be useful yet. See the https://github.com/biolab/orange3/tree/dask branch.
Orange classes do not have any special memory-measuring functions.
Feel free to discuss further! And sorry that I missed your message originally.
Great! I guess I get the general idea.
And yes, I think make sense try dask instead of change the way current Table works (as I'm not as sure try lazy loading could add more bugs).
After some debug, just noticed that long running programs (which can be the case of QT if not using different thread for heavy operations) even if deleting references to a very large object, python garbage collector can still not release 100% of the memory of that object. It might me just that the garbage collector keep some memory which it might use just after (not necessarily a memory leak, but I did not tested if doing it many times it could eventually be a memory leak).
I did not tested, but by intuition, if a heavy work occurs on a dedicated worker thread (not on the GUI thread) not only it allow to be responsive (like the title of the tutorial https://orange3.readthedocs.io/projects/orange-development/en/latest/tutorial-responsive-gui.html)) but as soon as the worker thread is not need anymore, any non used object will be freed from memory.
Without the strategy of using worker threads (and likely if not reusing the Table, which already handles well memory even if used many times) I discovered that is possible to a widget developer could be using more memory than think it is. But again, this is not about Orange, and more about how python works on QT, and the GUI
Humm... I liked the approach!!!
Not sure right now if will need it, of something already custom type, this would actually allow have widget with many, many outputs (which could for example be different enough to create different objects in memory which did not reuse each other data)
Thanks!
After some debug, just noticed that long running programs (which can be the case of QT if not using different thread for heavy operations) even if deleting references to a very large object, python garbage collector can still not release 100% of the memory of that object. It might me just that the garbage collector keep some memory which it might use just after (not necessarily a memory leak, but I did not tested if doing it many times it could eventually be a memory leak).
Python memory management has some interesting details. Python allocates space for small or large objects differently. I forgot the details and they might be platform (and version) dependent, but here goes... If you allocate large objects, they are taken directly from the OS, and when released, the space is returned to the OS. If you allocate small objects, python does not release their space to the OS after garbage collection, but does reuse that space for other small objects.
Therefore, even correctly written long running processes can seem as they are growing in memory, and there is nothing you can do about it, except perhaps occasional rewrite targeting less intermediate results... I have used Python for years before I noticed a problem, so it does not show up that often. See #2968 for details.
Worker threads will not reduce anything, but separate worker processes would. They introduce other inefficiencies though.
this would actually allow have widget with many, many outputs
Having widgets with many outputs is confusing and we try to avoid that. But yes, some widgets have outputs that are only rarely used and having the available all the time is indeed wasteful.
What's your use case?
I'm developing add-on for Orange, at now mostly to add features for data preparation (e.g. before the data becomes a Data or DataFrame). The ideal scenario would be have a subset of boring, low level file operations (such as KNIME or pentaho-kettle have) to make data cleaning before it become some sort of a tabular format good enough to be imported traditionally with Orange.
However, my challenge becomes know how Orange deal with memory before releasing the add-on for general use. For sake of this issue, while I still need to deal with interface "freezing" for long downloads (this article in on my todo list https://orange3.readthedocs.io/projects/orange-development/en/latest/tutorial-responsive-gui.html). However, as long as the user have disk space, the "importer" to generate the allow user add gigabyte size files on disk. And even for data which would on a 1:1 fits on memory, by using only pandas without proper optimized data types, it is easily to also use way too many memory.
What's your proposed solution?
With all this context said, I think two questions could solve it
import logging; log = logging.getLogger(__name__); log.exception(get_memory_size_of(self.data))**
in which the get_memory_size_of is something I can use to know the Orange3 internals? If this alone is not sufficient, maybe there's something you here already use, which would list all data objects sizes and which widgets created then or sort of?*self.Outputs.data_frame.send(self.data_frame)
is smart to discard the memory no widgets wants?Are there any alternative solutions?
Orange3 is actually quite fantastic to protect errors in specific widgets to blow up entire interface, but this don't work for memory-related issues. So I think is better to make the widgets on this add-on that prepare data for use with the Orange be aware of it to protect orange. But for now, think that most of what the data preparation steps are doing is... a visual frontend for what would be possible doing with one-time operation with python (not just pandas).
Also, maybe point for another topic, but since
FileRAW
andFileRAWCollection
just have codes to represent physical files on disk, this strategy could be used as lazy loading for other widgets. I just started with extension development around 2 weeks, so by now I'm mostly dealing with QT and deal with the basics, but I think would be feasible to exportFileRAW
to some Dask object or something you have.