biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Basic information about how data objects in Orange3 are handled in memory / tips for profiling add-on memory performance #6100

Closed fititnt closed 2 years ago

fititnt commented 2 years ago

What's your use case?

I'm developing add-on for Orange, at now mostly to add features for data preparation (e.g. before the data becomes a Data or DataFrame). The ideal scenario would be have a subset of boring, low level file operations (such as KNIME or pentaho-kettle have) to make data cleaning before it become some sort of a tabular format good enough to be imported traditionally with Orange.

Boring internals, not really need for this issue

The strategy I'm doing to be able to allow raw data preparation before converting to orange is a two tyoes, FileRAW and FileRAWCollection which mostly only have identifiers which explain how to find the real files (or directory with real files) on disk. In other words, I'm already somewhat using a way to pass the information between widgets, but it still following the philosophy of "In Linux and UNIX, everything is a file" in a literal sense. In this sense, even if eventually we here add features such as using pandas to convert a FileRAW to another FileRAW the end result will release memory as soon as it stops. To advantage of this development approach, the optimizations are mostly generic to how to handle memory with python (or pandas)

For now, the add-on is able to use abstract low level pandas.read_table, pandas.read_csv, pandas.read_excel, pandas.read_feather, pandas.read_fwf, pandas.read_html, pandas.read_json, pandas.json_normalize, pandas.read_orc, pandas.read_parquet, pandas.read_sas, pandas.read_spss, pandas.read_stata, pandas.read_xml to a dataframe, and I discovered some function in your code that convert data frames to Orange Table format.

Note: is explicitly out of my plans "reinvent the wheel" of what Orange3 do. For example, maybe one "smart default" if users are reusing workflows from someone else, but the data they have now is way much bigger, would be slice like 25% or 10% of the data and warn the user to optimize the types before passing for orange. So the user could know by next steps what could be optimized on previous ones until it fits the memory.

However, my challenge becomes know how Orange deal with memory before releasing the add-on for general use. For sake of this issue, while I still need to deal with interface "freezing" for long downloads (this article in on my todo list https://orange3.readthedocs.io/projects/orange-development/en/latest/tutorial-responsive-gui.html). However, as long as the user have disk space, the "importer" to generate the allow user add gigabyte size files on disk. And even for data which would on a 1:1 fits on memory, by using only pandas without proper optimized data types, it is easily to also use way too many memory.

What's your proposed solution?

With all this context said, I think two questions could solve it

  1. Did exist some way like import logging; log = logging.getLogger(__name__); log.exception(get_memory_size_of(self.data))** in which the get_memory_size_of is something I can use to know the Orange3 internals? If this alone is not sufficient, maybe there's something you here already use, which would list all data objects sizes and which widgets created then or sort of?*
  2. There's some general summary like how Orange manage memory? I assume it will reuse as much as possible one Data from a previous widget. I know low level way computer works, and I'm conformable with python, but not with GUIs or QT and I'm aware long-running scripts for like nodejs can memory leak.
    1. I think my main question here is... what happens if I generate different objects (such as Data and DataFrame) as output from an Widget, but the will never attach the DataFrame input of my widget to another widget, the way Orange works will free memory of the outputs which are not used by anything else? Does the self.Outputs.data_frame.send(self.data_frame) is smart to discard the memory no widgets wants?
      1. This question is relevant, because if is the case, I will avoid creating too much outputs for all potential widgets that would make use of it. So, for me would be easier to workaround (even if take 10's of hours) than wait for something be implemented/tested on Orange3

Are there any alternative solutions?

Orange3 is actually quite fantastic to protect errors in specific widgets to blow up entire interface, but this don't work for memory-related issues. So I think is better to make the widgets on this add-on that prepare data for use with the Orange be aware of it to protect orange. But for now, think that most of what the data preparation steps are doing is... a visual frontend for what would be possible doing with one-time operation with python (not just pandas).

Also, maybe point for another topic, but since FileRAW and FileRAWCollection just have codes to represent physical files on disk, this strategy could be used as lazy loading for other widgets. I just started with extension development around 2 weeks, so by now I'm mostly dealing with QT and deal with the basics, but I think would be feasible to export FileRAW to some Dask object or something you have.

markotoplak commented 2 years ago

Thanks for opening an interesting discussion.

First of all, Orange does not limit the data types you use in widgets, but of course, if you wish to connect these two widgets, they need to have compatible data types. So if you have a bunch of widgets which you'd like to work with large specialized files, feel free to use a data type that is essentially a pointer to a file. And then, after processing, convert it into a small (in-memory) Orange's data Table for interoperability with other widgets.

Memory is handled as in Python: whenever no one has a reference to an object, the garbage collector cleans the space. But yes, after the outputs are sent these objects are referenced by the canvas, because new connections that would need the data could appear after the fact. We do understand that this can be wasteful and have been discussing lazy outputs (see #5720). If you are working with your own data types, feel free to implement them as lazy, as recipes, which are then only evaluated on input. This won't be possible with the current unmodified Table.

The Table class usually takes most of the memory. It is composed of 3 numpy arrays (see https://orange3.readthedocs.io/projects/orange-data-mining-library/en/latest/reference/data.table.html). All functions working on these tables try to preserve memory space by reusing them as views whenever possible.

We are also working on a prototype that uses Dask on-disk tables, but that is not functional enough to be useful yet. See the https://github.com/biolab/orange3/tree/dask branch.

Orange classes do not have any special memory-measuring functions.

Feel free to discuss further! And sorry that I missed your message originally.

fititnt commented 2 years ago

Great! I guess I get the general idea.

And yes, I think make sense try dask instead of change the way current Table works (as I'm not as sure try lazy loading could add more bugs).

What I discovered in the mean time

After some debug, just noticed that long running programs (which can be the case of QT if not using different thread for heavy operations) even if deleting references to a very large object, python garbage collector can still not release 100% of the memory of that object. It might me just that the garbage collector keep some memory which it might use just after (not necessarily a memory leak, but I did not tested if doing it many times it could eventually be a memory leak).

What I'm going to do to make sure

Use worker thread

I did not tested, but by intuition, if a heavy work occurs on a dedicated worker thread (not on the GUI thread) not only it allow to be responsive (like the title of the tutorial https://orange3.readthedocs.io/projects/orange-development/en/latest/tutorial-responsive-gui.html)) but as soon as the worker thread is not need anymore, any non used object will be freed from memory.

Without the strategy of using worker threads (and likely if not reusing the Table, which already handles well memory even if used many times) I discovered that is possible to a widget developer could be using more memory than think it is. But again, this is not about Orange, and more about how python works on QT, and the GUI

The idea lazy outputs

Humm... I liked the approach!!!

Not sure right now if will need it, of something already custom type, this would actually allow have widget with many, many outputs (which could for example be different enough to create different objects in memory which did not reuse each other data)


Thanks!

markotoplak commented 2 years ago

After some debug, just noticed that long running programs (which can be the case of QT if not using different thread for heavy operations) even if deleting references to a very large object, python garbage collector can still not release 100% of the memory of that object. It might me just that the garbage collector keep some memory which it might use just after (not necessarily a memory leak, but I did not tested if doing it many times it could eventually be a memory leak).

Python memory management has some interesting details. Python allocates space for small or large objects differently. I forgot the details and they might be platform (and version) dependent, but here goes... If you allocate large objects, they are taken directly from the OS, and when released, the space is returned to the OS. If you allocate small objects, python does not release their space to the OS after garbage collection, but does reuse that space for other small objects.

Therefore, even correctly written long running processes can seem as they are growing in memory, and there is nothing you can do about it, except perhaps occasional rewrite targeting less intermediate results... I have used Python for years before I noticed a problem, so it does not show up that often. See #2968 for details.

Worker threads will not reduce anything, but separate worker processes would. They introduce other inefficiencies though.

this would actually allow have widget with many, many outputs

Having widgets with many outputs is confusing and we try to avoid that. But yes, some widgets have outputs that are only rarely used and having the available all the time is indeed wasteful.