PyCQA / redbaron

Bottom-up approach to refactoring in python
http://redbaron.pycqa.org/
694 stars 74 forks source link

Inserting a large block is very slow #149

Open sdh4 opened 6 years ago

sdh4 commented 6 years ago

Inserting a large block into a ProxyList can be very slow because of the _synchronize() method called after each insertion.

Propose an insert_multiple() method that allows multiple insertions at the same time: insert_multiple.patch.txt

Use of this patch can easily give an order of magnitude speedup for use cases such as loop unrolling that involve a large number of insert operations.

rojaster commented 6 years ago

Wassap @sdh4, does it take input as a string to convert to nodes?

ps> please insert code with

insert code

--- base_nodes.py.orig  2017-01-02 15:59:58.000000000 -0600
+++ base_nodes.py   2017-11-14 14:16:19.626943779 -0600
@@ -1375,6 +1375,17 @@
         self.data.insert(index, [value, None])
         self._synchronise()

+    # sdh4 11/14/17 add insert_multiple
+    def insert_multiple(self,index,values):
+        """ Insert multiple values from an iterable into the ProxyList 
+        at the specified index, without resynchronizing after each one. """
+        for value in values:
+            value = self._convert_input_to_node_object(value, parent=self.node_list, on_attribute=self.on_attribute)
+            self.data.insert(index, [value, None])
+            index+=1
+            pass
+        self._synchronise()
+    
     def append(self, value):
         self.insert(len(self), value)

What about converting if we wanna do multiple inserts from one RB tree to another RB tree?

sdh4 commented 6 years ago

Here's an improvement that can handle a broader array of sources (such as fst's) because it uses _convert_input_to_node_object_list().

Not sure how best to do multiple inserts in different places with no synchronization. I guess a data structure with an array of indices and node lists could be used.

Leaving the tree unsync'd seems like it could be a bad idea.

Separately, this seems to trigger an indentation bug which I will list separately.

--- base_nodes.py.orig  2017-01-02 15:59:58.000000000 -0600
+++ base_nodes.py   2017-11-15 09:42:22.360961636 -0600
@@ -1375,6 +1375,17 @@
         self.data.insert(index, [value, None])
         self._synchronise()

+    # sdh4 11/14/17 add insert_multiple
+    def insert_multiple(self,index,values):
+        """ Insert multiple values from an iterable into the ProxyList 
+        at the specified index, without resynchronizing after each one. """
+        values = self._convert_input_to_node_object_list(values, parent=self.node_list, on_attribute=self.on_attribute)
+        for value in values:
+            self.data.insert(index, [value, None])
+            index+=1
+            pass
+        self._synchronise()
+    
     def append(self, value):
         self.insert(len(self), value)
duncf commented 6 years ago

I haven't looked at the LineProxyList algorithm, but I took a pretty close look at the CommaProxyList algorithm a few months ago.

CommaProxyList was super slow for big files while computing the inter-element indentation. (node.indentation is shockingly slow.) I suspect that could be what's happening here. #146 has my changes. A similar change to LineProxyList might fix this and #150.

sdh4 commented 6 years ago

The speed problem here seems to come from the self._synchronize() call completely rebuilding the inner list and then the node list from the inner list. When you do that iteratively the process ends up O(n^2) which blows up quickly for large n.

The patch above addresses the problem and does not seem to generate any other problems (#150 is a separate and independent problem). Shall I make a pull request?