davidhalter / jedi

Awesome autocompletion, static analysis and refactoring library for python
http://jedi.readthedocs.io
Other
5.81k stars 508 forks source link

Refactoring with the new parser #667

Closed davidhalter closed 4 years ago

davidhalter commented 8 years ago

Note: #103 Discussed a lot of things about introducing refactoring. Now that we're a step further, let's discuss the API. This writeup is the result of someone wanting to contribute towards refactoring.

Refactoring

It's been a while, but since I've been ask how to tackle the issue, let's start simple:

  1. Use the latest branch (currently linter)
  2. Small of initial refactorings.
  3. Specify the API is unfortunately the first thing we need to do. Very happy for feedback there.
  4. Testing

The first and easiest refactorings that I'd like to see are:

One of the next steps could be (or basically anything else):

I'm suggesting the following API (note that Script is given a pointer (line / column index) as well as the path of the file::

def rename(script: jedi.Script, new_name: str) -> Refactoring:

def extract(script: jedi.Script, new_name: str, end: Tuple[int, int]): -> Refactoring
    """
    The starting position is the line/column input of jedi.Script and the
    `end` position the last place, that you want to exclude in your
    exctract logic.
    """

# Doesn't need anything additional, since it just deletes:
def inline(script: jedi.Script -> Refactoring):

Of course the bigger work is defining the return class::

class Refactoring():
    def __init__(self, whatever_we_need_its_private):
        pass

    def get_changed_files(self):

    def get_renames(self) -> List[Tuple[str, str]]:
        """
        Files can be renamed in a refactoring.
        """

    def diff(self) -> str:

class RefactoringError(Exception):
    def __init__(self, message: str):

Please critize it! I'm very happy for all the feedback, especially if you don't like it. Annotations will of course be removed (and changed to docstrings), because we have to maintain Python 2/3 compatibility.

Testing

Since I have previously started a bit of work on refactoring, I think it would be very beneficial to use that test suite. There's a small list of tests already in test/refactor/* (renaming, extract/inline variable). I'm very happy to give pointers about how to use them. Currently they are disabled. Just enable the switch again in test/test_integration.py.

Issues

Currently there's only one known issue (will add to this, if there's more): The parser does error recovery. For now just ignore this, but when we're pushing this to the users, we need to take care of it.

Hints

If you are implementing all of this, you have to use the parser tree's jedi.parser.tree logic. If you think that doing something requires a lot of code, you're probably wrong (or the code should just be part of the tree).

Generally a refactoring has 3 steps:

The module jedi/refactor.py has a very similar class structure as the one proposed above, but it doesn't work. You can throw away basically anything. It can give you an idea how the structure would look like, but the code is crap. THROW IT AWAY.

Please let me know if you need any help to start your work.

dbrgn commented 8 years ago

What format does the diff use?

davidhalter commented 8 years ago

That is a good question. What would you suggest? Should we make that apparent in the command name as well? (like git_diff or whatever)

dbrgn commented 8 years ago

It depends on how the workflow should be. I thought that a refactoring would simply reorder the AST and then rebuild the entire source. But for some clients it could indeed be simpler to apply a diff.

If you choose a diff format, I'd go with a unified diff (diff -u), it can be applied using patch.

dbrgn commented 8 years ago

Another option would simply be to send line numbers and the corresponding content for all lines that changed...

pigmej commented 8 years ago

for integrating with let's say editors, probably the best would be diff -u format. It's also pretty much "standard" format.

purpleP commented 8 years ago

@davidhalter OK, I've started to look at the code a little and while I haven't understood how your code is working right now I can say that generally my thoughts on API are this.

First let's say what is Refactoring? I think about refactoring as function of type (using haskell-like notation)

Refactoring :: FilePointer -> Parser -> [Change]

so it takes file pointer, parser and produces change

where FilePointer (should be a better name) - is a pointer to an element of a file that we want to refactor. Refactoring should then use Parser to convert this pointer to some language element, like a function, class of variable etc and after some work that depends on actual refactoring return Changes.

Change should contain the state of file before change, after and the beetween them. Or maybe it would be better to have a function that can generate diff beetween two states.

I think this api is pretty simple and all we can discuss here is what Change should contain. Do we only want the diff or also the state before and after the refactoring.

Also it is probably important how we represent a file pointer. I mean should we load the file in memory as a string so that nobody could temper with it while we working on it or just use something like with open(file)... etc.

So what do you think about that?

davidhalter commented 8 years ago

@purpleP

so it takes file pointer, parser and produces change

Well, IMO Jedi's refactoring shouldn't produce the change. The IDE itself should be doing the change, because I think file management is the responsibility of the IDE.

Do you agree? I think your whole post was about file management, which I don't intend to do.

purpleP commented 8 years ago

@davidhalter No, of course Jedi doesn't need to change file. I think it should provide information in the form of Change object about what files should be changed to perform refactoring and how exactly they should be changed.

davidhalter commented 8 years ago

So what's your proposal? What do you think should be different?

purpleP commented 8 years ago

Well, as I see now the approach you have now is very similar to what I've been proposing. Other than I call Refactoring a function that does refactoring and not the result that is returned. While I see you reasoning for that naming probably "function return refactoring to be performed by editor" I think slightly different about that "jedi performs refactoring and returns changes to be applied by editor". The main question to me right now (I've looked at the code of parser a little, but haven't understood api yet) is how to use parser. Maybe you could provide or point to some examples on that?

purpleP commented 8 years ago

@davidhalter I see now that parser isn't used in refactorings so far, so I can't find any examples yet. Also I can't understand how tests are working.

# --- simple
def test1():
    #? 7 blabla
    test1()
    AssertionError
    return test1, test1.not_existing
# +++
def blabla():
    blabla()
    AssertionError
    return blabla, blabla.not_existing

This and lines in extract.py inline.py and rename.py look like some mumbo jumbo. PyCharm even shows syntax errors on that files. Also what is not clear to me is how rename can be done without parser, because you need to understand what should be renamed and what isn't

purpleP commented 8 years ago

@davidhalter also I don't see requirements.txt or any other form of dependencies management (not listed in setup.py).

purpleP commented 8 years ago

@davidhalter Ok, I understand now that Script (not a good name imho) have parser for some reason. and redirects calls to user_stmt to parser. But the parser doesn't return most specific statement for position, it returns the whole expression for me in case

  1. z = foo() + 1 and position 6,9 . Is that intentional? I've been expecting it to return something like FunctionApplicaton Node with name foo and arguments etc.
davidhalter commented 8 years ago

I call Refactoring a function that does refactoring and not the result that is returned.

Well, we could still implement that as well. I think that would even be a good idea, but for now I don't think it's that important. Once we need it, we just add a method Refactoring.refactor.

@davidhalter also I don't see requirements.txt or any other form of dependencies management (not listed in setup.py).

Jedi doesn't have any dependencies.

davidhalter commented 8 years ago

This and lines in extract.py inline.py and rename.py look like some mumbo jumbo.

True, it looks a bit weird, however the tests are actually quite straightforward. If it's syntactically incorrect, that is something that should change, of course. The # --- NAME and # +++ are delimiters to separate tests.

--- starts a test and the code after +++ is the result of the test. #? 25 x is the line where the test starts (comment line + 1, indent 25). x would then be the rename.

However feel free to create your own test suite. If you like it better to test with py.test directly, I'm very happy with that as well. I don't like these tests as much anymore as I did when I originally created them.

The main question to me right now (I've looked at the code of parser a little, but haven't understood api yet) is how to use parser. Maybe you could provide or point to some examples on that?

Well, don't get too confused about the user_parser module. It's something that you only need if you work with completions and that kind of user interaction (not refactoring, though).

First I would just run the necessary calls (like jedi.usages for rename) and then you just work with that. Try to inspect Definition._definition a bit. You will see that you have access to the name positions (and many times also the parser)

You can also get the current module directly by using module = Script._parser.module(). (Don't care about protected names, yet). You can e.g. call module.name_for_position than you modify the Name.value or Name.prefix and once you have done that with all of your refactoring you call module.get_code() to retrieve the changed file.

purpleP commented 8 years ago

@davidhalter isn't pytest a dependency?

davidhalter commented 8 years ago

pytest is a dev dependency and by that the only one (maybe tox as well). I would not want everyone to install it.

BTW: I updated my comment above, please read :)

purpleP commented 8 years ago

@davidhalter I agree with that pytest is a dev dependency and I don't like dev dependencies to be installed with setup.py install, but it's actually the way to do it in python. So people could run tests with setup.py test. I see you use travis ci to run tests, right? Haven't used it yet. Can people that install this library run tests somehow before installing?

Anyway, I finally came up with testing scheme. You can see it in latest commit in my fork. So what I do is for every refactoring I will create two folders in tests folder for fixtures. One that is state of files before refactoring and one that is reflects state that should be after refactoring. This folders can have subfolders etc. Every file in this folders should have a unique name so that file from input folder could be matched to zero or one file in the output folder. Then I'm just generating change dict from pairs (input_file, output_file).

Right now this approach can't deal with renaming of files during refactoring, but that's fairly easy. I think that convention that if file should be renamed after refactoring then in output folder we should create a file with name that have name of the input file as a substring

for for example input file is 'foo.py' and after refactoring it should be renamed. We create a file 'new_foo.py' and in tests pass a new name of the file as 'new_foo'.

I hope that this is clear enough.

So after I've created first test with this scheme rename refactoring is failing when I'm renaming function that's used in multiple files. I will fix this some time later.

davidhalter commented 8 years ago

So after I've created first test with this scheme rename refactoring is failing when I'm renaming function that's used in multiple files. I will fix this some time later.

Nice. It might also be an issue with Jedi's usages itself. That method is definitely not bug free.

Are you developing on master branch? If you do, please merge with linter first (or at least dev, but better linter), otherwise we will have very nasty merges.

I agree with that pytest is a dev dependency and I don't like dev dependencies to be installed with setup.py install, but it's actually the way to do it in python. So people could run tests with setup.py test.

You're right, sorry. I learned something again - thanks!

I see you use travis ci to run tests, right? Haven't used it yet. Can people that install this library run tests somehow before installing?

Travis-ci is only for continous integration. It's not for local testing. You can see the results online when you push. If you want to test multiple versions locally, use tox. It's awesome.

purpleP commented 8 years ago

@davidhalter Oh, I've somehow haven't seen that you use master, dev etc flow (like git-flow) and I've branched from master. I'll merge with dev because it's git-flow's way AFAIK. You can see my last commit. I've rewritten rename refactoring and it passes my test now. Previous version somehow haven't changed all files as I've already said. So what is not clear for me now is the api. I don't know how jedi integrates with editors, so I don't really understand what should be returned from refactoring function so that editor could use it to change the source code.

Right now I can see that the 'real' api is change dictionary (I don't understand why we need Refactoring class and new_files, old_files methods) that contains for every file it's path after refactoring and content before and after refactoring. Can you explain how they used?

I can't remember refactoring for which this wouldn't be enough. I mean I can't remember refactoring that for example removes a file (and even then we could represent the deletion with None instead of new path). So let's think together is this format is enough?

If it's enough then I have a question why dictionary instead of tuple of tuples. Is indexing by key is used somewhere?

Sorry if it's annoying that I questioning everything, But I never take anything for granted and always trying to understand design choices and think whether they can be improved or not etc. That's just how my brain works I guess.

davidhalter commented 8 years ago

Right now I can see that the 'real' api is change dictionary (I don't understand why we need Refactoring class and new_files, old_files methods) that contains for every file it's path after refactoring and content before and after refactoring. Can you explain how they used?

If it's enough then I have a question why dictionary instead of tuple of tuples. Is indexing by key is used somewhere?

The questions are very fine. In the first post you can see a sketch of the API. I intentionally left the intialization black, because it does not matter how its implemented. The logic you can see in Jedi right now is very old and I was confused when I looked at it myself. It was probably written the way it was written, because the old refactoring code was very complicated (it had to, because of the old parser). Just change the initialization and the storage to whatever you want - I'm very fine with that.

purpleP commented 8 years ago

@davidhalter I don't understand why returning tuple (file_name, (new_file_name, old_content, new_content)) isn't sufficient. Also in your extract function

script._parser.user_stmt()

If I'm trying to go to definition of user_stmt it gives me 'Can't find any definitions for this' While if I'm trying to do the same thing in PyCharm it gives me list of possible candidates. And AFAIK it suggest valid candidates.

screenshot 2016-01-08 17 00 43

You once said that your parser is better than in PyCharm and if someone find occasions when PyCharm parser is better they should say so, so this is what I'm doing.

Anyway what 'user_stmt' function is doing?

davidhalter commented 8 years ago

@purpleP

I don't understand why returning tuple (file_name, (new_file_name, old_content, new_content)) isn't sufficient.

I don't understand what you're asking me.

script._parser.user_stmt()

This works for me. Cannot reproduce. If you think this is a serious issue post it with a direct Jedi call as a new issue, then I will investigate again.

purpleP commented 8 years ago

@davidhalter I'm asking what is the purpose of Refactoring class. Why not just return list of tuples?

davidhalter commented 8 years ago

Because it's a horrible API :) Seriously, there's a very good reason no well known 3rd party API in Python just returns complicated tuples: Because nobody understands that. Classes are very well understood and can be documented.

purpleP commented 8 years ago

@davidhalter I agree with what you say. But still don't understand why we need object in this case. We can define for example

class FileState(object):

    def __init__(self, path, lines):
        self.path = path
        self.lines = lines

class FileChange(object):

    def __init__(self, initial_state, resulting_state):
        self.initial_state = initial_state
        self.resulting_state = resulting_state

and return list of FileChange objects

davidhalter commented 8 years ago

Because the object allows for more flexibility. It allows for diffs for example.

purpleP commented 8 years ago

@davidhalter I think it just clutters the api. I think API will probably be used by going though the list of changes and applying them. So if we return the list of changes there would be no need to type something like rename(s, pos).changes it would be just rename(s, pos) You're talking about diffs, but what diff should return in case api will return the object that is incapsulating the list of changes? Diffs for every file?

I don't understand why not just use something like that

def possible_usage_of_refactoring(s, pos):
    [apply_change(ch) for ch in rename(s, pos)]

def apply_change(ch):
   d = diff(ch) # or ch.diff 
   do_something_specific_to_editor_with_diff(d)

If you don't agree with that please explain what editor would need exactly to apply refactoring? Just new file name and diff? What is diff actually? Or new name and new content? I don't understand that.

Also I've opened the issue about goto_definitions, I would appriciate if you would look into it

davidhalter commented 8 years ago

You're talking about diffs, but what diff should return in case api will return the object that is incapsulating the list of changes? Diffs for every file?

Not diffs for every file. One diff. This obviously includes all the different files. See it's not just about applying a refactoring. There are multiple possible tasks. Someone maybe wants to display the differences first in a diff -u format, someone wants a different format. Someone wants to directly apply the refactoring. There's a lot of different possible use cases. To allow all of them a class is the way to go.

Also I've opened the issue about goto_definitions, I would appriciate if you would look into it

I've looked into it quickly and I cannot seem to find the issue.

purpleP commented 8 years ago

@davidhalter And what this diff for all files should return?

davidhalter commented 8 years ago

The diff.

purpleP commented 8 years ago

@davidhalter Can you give and example, maybe? Because 'Diff should return the diff' isn't explaining anything. At least for me.

I'm trying to work on extract refactoring now. Is there an api to get the 'element' (I don't know how to name that properly) for position? I see that you've used get_statement_for_position, but it return the whole statement, not the current element (which we want to extract, like a function call, or variable name etc).

davidhalter commented 8 years ago

http://unix.stackexchange.com/questions/73771/what-does-u-of-diff-really-do

This is a diff (diff -u):

$ diff -U 2 file1.txt file2.txt 
--- file1.txt   2013-04-26 09:39:13.496835363 -0400
+++ file2.txt   2013-04-26 09:38:04.838299195 -0400
@@ -1,6 +1,7 @@
 The Rain in Spain by
+added extra line here
 Servants Poor Professor Higgins!
 Poor Professor Higgins! Night and day
 He slaves away! Oh, poor Professor Higgins!
-All day long On his feet; Up and down until he's numb;
+All day long On his feat; Up and down untile he's numb;
 Doesn't rest; Doesn't eat;

I'm trying to work on extract refactoring now. Is there an api to get the 'element' (I don't know how to name that properly) for position?

No not really, I guess it's easiest to just create it for that use case.

purpleP commented 8 years ago

@davidhalter Yeah, I get what diff returns. But my question was what Refactoring.diff should return? A list of tuples (filename, diff)? If so, than I don't see benefits in that. I don't see why we need to return something that would suite everybody's needs. I think we should return something that have all info needed to perform edits for refactoring. And everybody can do what they need with this info. For what I know right now Refactoring class is redundant entity.

davidhalter commented 8 years ago

Yeah, I get what diff returns. But my question was what Refactoring.diff should return? A list of tuples (filename, diff)? If so, than I don't see benefits in that.

Of course not. It will return the diff for all the files. It returns a STRING. Of course we don't return a tuple. Because when we return a tuple we have the same old issues again that we don't want.

I have tried enough now. I know you think differently. I will not discuss this any further. I have made enough experiences to know that this class will be needed (and also that tuple returns are a very bad idea). We have different opinions - that's ok, but I have to maintain the software for ages, so it's also going to be my problem.

purpleP commented 8 years ago

@davidhalter it's not about 'opinions'. I just don't understand what you really want. You want Refactoing.diff to return a string, OK. What the content of this string should be? What I'm trying to get from you is exactly what information refactoring function should provide? Not the format of this information, but what information exactly?

purpleP commented 8 years ago

@davidhalter Oh, and I've forgot to ask what version of python do jedi use?

purpleP commented 8 years ago

@davidhalter Ok, I've looked closely to your Refactoring class and I understand what diff method is doing, but I have two questions. if old_path is None what unified_diff would return? And why concatenate diffs in one string? Is this format used by some editor? Just curious.

davidhalter commented 8 years ago

Python 2.6/2.7 and Python 3.2+ (3.5 will be working once someone tackles the issue).

Don't look at my Refactoring code there. As I've mentioned before

And why concatenate diffs in one string? Is this format used by some editor?

It's not used by editors, it's the way how git patches work and how a lot of humans like to read "changes". e.g. git also will give you a diff of multiple files. That's very intentional.

purpleP commented 8 years ago

@davidhalter OK, That's what I wanted to know! I've made some changes to the inner api. What do you think about them? I've merged with dev, so just see refactoring.py. I've renamed Refactoring to Changes and refactored it a bit, but the api is the same. I haven't done regression test, but it should work. I think that FileState probably should be merged with Content class by adding path to Content maybe, but still have doubts about it. Also function change_for_refactoring have been added to incapsulate that for all refactoring we should generate changes and then pass them to the Changes constructor.

davidhalter commented 8 years ago

What do you think about them?

Could you create a PR? It's way easier to look at the changes there. Because of diffs ;-)

purpleP commented 8 years ago

@davidhalter I can, but only renaming works now. I'm working on extract right now (the old version doesn't work) and I haven't tested inline yet.

davidhalter commented 8 years ago

It's ok, just put a WIP in the title. Pull requests are supposed to be work in progress.

purpleP commented 8 years ago

@davidhalter I've opened another pull request to merge with dev as you've asked. But I have another set of questions.

So I need to somehow refactor get_statement_for_position method of the BaseNode class. First thing I've noticed is that you just doing dfs and checking some condition and returning first match.

  1. My first question is probably stupid, but still. Why not use some graph (or more specific tree) library, that already have many of such functions, dfs, bfs, traversals etc? Your node classes could remain mostly intact, just all methods of Node that involves children could be function on graph.

This is probably a bad idea I think, but still it would interesting to hear your opinion on that.

  1. So while refactor to using graph library is probably a bad idea in the first place I've still recognised the pattern of dfs searching on condition. And what I need to implement extract refactoring is the same search just with different condition. This leads to adding this functions.
def find(iterable, predicate):
    return next((x for x in iterable if predicate(x)), None)

def find_node(start_node, predicate):
    return find(traverse(start_node), predicate)

def traverse(start_node):
    return chain([start_node], *map(traverse, iter(start_node.children)))

I've kind of mocked node in this test, so this should work on you tree, because interface node.children is the same.

class MockNode(object):

    def __init__(self, _id, children=()):
        self.children = list(children)
        self._id = _id

@fixture
def tree():
    level2_0 = MockNode(_id='2_0')
    level2_1 = MockNode(_id='2_1')
    level1_0 = MockNode(_id='1_0', children=(level2_0,))
    level1_1 = MockNode(_id='1_1', children=(level2_1,))
    return MockNode(_id='0', children=(level1_0, level1_1))

def test_traversal(tree):
    ids = [n._id for n in traverse(tree)]
    assert ids == ['0', '1_0', '2_0', '1_1', '2_1']

def test_find_node(tree):
    assert find_node(tree, lambda n: len(n.children) == 1)._id == '1_0'
    assert find_node(tree, lambda n: len(n.children) == 3) is None

And what I need now is to refactor get_statement_for_position method to use this functions and then to add new search function somewhere that would find 'element' for position. How this 'element' properly called by the way?

  1. This question is about refactoring get_statement_for_position
if c.type not in ('decorated', 'simple_stmt', 'suite') \
                        and not isinstance(c, (Flow, ClassOrFunc)):
                    return c
                else:
                    try:
                        return c.get_statement_for_position(pos)
                    except AttributeError:
                        pass  # Must be a non-scope

First. I know only one legitimate use for isinstance check in OO code - custom equality checks, but finding node of specific type in AST is probably another one (at least I can't imagine any other way right now, besides Visitor pattern of course, but it would just be more code I think without any benefits), but why nodes have both type field if they have python type? Second. What condition should I use to find 'element' for position? This question probably follows from me not understanding what this 'element' is named properly.

PS Oh, and I forgot to ask. I haven't found test for get_statement_for_position. Could you provide at least a template for such test. It should probably involve constructing AST either by parsing some file or manually I suppose.

PSS Also I'm not sure what does attribute error doing there and how should I refactor it with my find_node functions. I mean I can define function that would check for existence of method in object, but that's just seems wrong. It should be enough to filter by type really. Maybe this code lack some base class or something, I don't know.

purpleP commented 8 years ago

@davidhalter I've started to look at your AST parser and while I don't understand why the code is doing what it's doing yet I think it would be beneficial to refactor it a little.


def parents(node):
    parent = node.parent
    if not parent:
        return []
    else:
        return chain([parent], *parents(parent))

def parent_until(node, classes=(), including_start=True):
    ps = list(parents(node))
    if not including_start:
        ps = ps[1:]
    predicate = partial(isinstance, cls=classes)
    return find(parents, predicate)

def find_root(node):
    return list(parents(node))[-1]

def parent_until_or_root(node, classes=(), including_start=True):
    ps = list(parents(node))
    if not including_start:
        ps = ps[1:]
    # is this function really used in this way?
    # Why not just have two functions?
    if not classes:
        return ps[-1]
    else:
        predicate = partial(isinstance, cls=classes)
        return find(parents, predicate)

def defitinitions(node):
    def predicate(node, parent):
        return (isinstance(node, (Node, Name)) and
                node.type == 'testlist_comp' and
                parent.type != 'simple_stmt' and
                len(node.children) >= 1)

    ps = [node] + list(parents(node))

    possible_defs = ((scope.children[1] for scope, parent in zip(ps, ps[1:])
                      if predicate(scope, parent)))
    return find(possible_defs, partial(isinstance, cls=CompFor))

def parent_scope(node, include_flows):
    is_flow = partial(isinstance, cls=Flow)

    def is_scope(node):
        return node.is_scope()
    if include_flows:
        predicate = is_flow
    else:
        predicate = is_scope
    return find(parents(node), predicate)

Basically I've tried to refactor while loops into more functional style, which removed code duplication in parent-finding iteration.

I haven't rewritten the methods themselves because I don't see any test, so I can't really be sure that I've refactored this methods correctly. So I would like if you could provide guidance on what this methods really mean and how to test them.

dbrgn commented 8 years ago

@purpleP you should probably provide refactorings of unrelated parts of the code as a separate pull request, so that they can be reviewed independently of the refactoring code. Makes it much easier for @davidhalter to review.

purpleP commented 8 years ago

@dbrgn Yes, obviously. I just wanted to give a taste what I'm going to do, before I'll do that. The problem is - this code isn't tested AFAIK, and to write the test I will probably have to understand all parsing code.

davidhalter commented 8 years ago

So I need to somehow refactor get_statement_for_position method of the BaseNode class.

No please don't. I don't think you need to refactor it. What I suggested is that you need a similar function. But this function doesn't need to live in the syntax tree business logic.

My first question is probably stupid, but still. Why not use some graph (or more specific tree) library, that already have many of such functions, dfs, bfs, traversals etc? Your node classes could remain mostly intact, just all methods of Node that involves children could be function on graph.

It's not a graph, it's a tree. Therefore we have something incredibly unspectacular to work with. DFS is something you can write in like 2s. Jedi has very strong performance requirements in terms of memory and performance. A library is therefore probably a bad idea. I think that trees (contrary to graphs) are so easy to work with that you really don't need a library. However it really doesn't matter, since we're stuck with what we're having anyways.

And what I need now is to refactor get_statement_for_position method to use this functions and then to add new search function somewhere that would find 'element' for position. How this 'element' properly called by the way?

This element can be called like 50 different ways. Therefore it will be more important to work correctly with positions. It's probably better to do a different kind of search: Search the leaf under the cursor (careful it might be whitespace, in that case take the next leaf?) and then you look for parents until you are not allowed to anymore (this is limited by a certain set of node names, see https://docs.python.org/3/reference/grammar.html).

I don't think a variation of get_statement_for_position is a good solution.

but why nodes have both type field if they have python type?

Well, because not all nodes have a Python type. Also you seem to come from a very "healthy" world. You talk about legitimate usages and so on. Don't underestimate historical issues. A lot of the things in Jedi are not ideal, because at some point we had to refactor. The refactorings were not always perfect so you have certain issues within the code. The confusion you're talking about is a real one and it's a historic one. I have been wanting to switch to mostly .type lookups, but the isinstance logic is just still everywhere (and sometimes I actually prefer it).

PS Oh, and I forgot to ask. I haven't found test for get_statement_for_position. Could you provide at least a template for such test. It should probably involve constructing AST either by parsing some file or manually I suppose.

There's like 100 indirect tests. If you mess up the user_stmt some test will start to fail (within Jedi). You don't need to test it separately, I know that's not optimal, but a lot of things aren't.

.PSS Also I'm not sure what does attribute error doing there and how should I refactor it with my find_node functions. I mean I can define function that would check for existence of method in object, but that's just seems wrong. It should be enough to filter by type really. Maybe this code lack some base class or something, I don't know.

I'm pretty sure this AttributeError is just there for checking if it's a Leaf. However you don't have to refactor that function. If you do that's ok. But I will check the complexity of that function and the general time behavior. I don't want these things to get a lot slower.

Basically I've tried to refactor while loops into more functional style, which removed code duplication in parent-finding iteration.

Please don't. I don't like partial and I'm not willing to merge that. I'm all for simplification but not if it goes that way. If you need new functions define them.

e.g.

def find_root(node):
   return list(parents(node))[-1]

Is not something I want there. The parser is already slow enough (and the parser tree logic). If you need this function please call it get_module and use a simple while to traverse.

Also you renamed some of the functions (like get_parent_until). I can understand that, but please don't merge that with your current refactoring code. Like @dbrgn said it will get extremely messy and it's not possible to review it that way.

purpleP commented 8 years ago

@davidhalter Hi again. I had some time to work on you tree parsing code.

def parents(node):
    parent = node.parent
    while parent:
        yield parent
        parent = parent.parent

def find_root(node):
    return last(parents(node))

def last(iterable):
    last = next(iterable)
    for last in iterable:
        pass
    return last

Maybe I don't know python good enough, but AFAIK this code shouldn't take more time or space than simple while loop. At the same time I've already managed to decrease the code size in functions that I've rewritten by ~10-25 percents. I've also probably found a bug.

    def get_previous(self):
        """
        Returns the previous leaf in the parser tree.
        """
        node = self
        while True:
            c = node.parent.children
            # if we are in the first leaf we will have blabla not in list exeption instead of indexerror because we comparing with self instead of using child, node in while loop
            i = c.index(self)
            if i == 0:
                node = node.parent
                if node.parent is None:
                    raise IndexError('Cannot access the previous element of the first one.')
            else:
                node = c[i - 1]
                break

        while True:
            try:
                node = node.children[-1]
            except AttributeError:  # A Leaf doesn't have children.
                return node

The method with most reduced code size right now is this one

    def get_definition(self):
        scope = self
        while scope.parent is not None:
            parent = scope.parent
            if scope.isinstance(Node, Name) and parent.type != 'simple_stmt':
                if scope.type == 'testlist_comp':
                    try:
                        if isinstance(scope.children[1], CompFor):
                            return scope.children[1]
                    except IndexError:
                        pass
                scope = parent
            else:
                break
        return scope

And after refactoring

            for scope, parent in nwise_iter(chain((self,), self.parents, (None,)), 2):
            if (not parent or not isinstance(scope, (Node, Name)) or
                parent.type == 'simple_stmt'):
                return scope
            if parent.type == 'testlist_comp':
                next_s = next_sibling(scope)
                if isinstance(next_s, CompFor):
                    return next_s

Almost 50% less code. It's all generator based, so no unnecessary memory consumption.

Talking about slowness of parser.

    def assignment_indexes(self):
        indexes = []
        node = self.parent
        compare = self
        while node is not None:
            if is_node(node, 'testlist_comp', 'testlist_star_expr', 'exprlist'):
                for i, child in enumerate(node.children):
                    if child == compare:
                        #insert to list is O(n) operation, so this code have quadratic running time.
                        indexes.insert(0, int(i / 2))
                        break
                else:
                    raise LookupError("Couldn't find the assignment.")
            elif isinstance(node, (ExprStmt, CompFor)):
                break

            compare = node
            node = node.parent
        return indexes

If I understand correctly this code had quadratic running time.

    def assignment_indexes(self):
        """
        Returns an array of tuple(int, node) of the indexes that are used in
        tuple assignments.

        For example if the name is ``y`` in the following code::

            x, (y, z) = 2, ''

        would result in ``[(1, xyz_node), (0, yz_node)]``.
        """
        def is_list(node):
            return is_node(
                node, 'testlist_comp', 'testlist_star_expr', 'exprlist'
            )
        pairs = takewhile(
            lambda (ch, p): is_list(p) or not isinstance(p, (ExprStmt, CompFor)),
            nwise_iter(chain((self,), self.parents), 2)
        )
        indices = [(index_in_parent(child, parent) / 2, parent)
                   for child, parent in pairs if is_list(parent)]
        indices.reverse()
        return indices

This code on the other hand is just using one reverse call and generators so it's O(n). And it's almost 40% less code. I don't know the semantics of code, so some function names is probably not very good, but still I believe it's more readable.

I will rewrite a few other functions and after that I will try to test performance of old code vs new code. If my changes would make your parser less performant then I would probably abandon my attempt at reducing the code size or it would live in fork only just for my fun. But I doubt that as I only use generators everywhere it's possible. And push this changes with refactoring code would be bad idea indeed obviously, because for now it's completely unrelated. So If all goes well performance wise I will create pull request with refactored code and then another one with refactoring.

davidhalter commented 8 years ago

Sorry for not answering earlier. I'm very busy at the moment and I find it hard to take out time for Jedi, especially if it's not clearly defined questions. This is a good example: You post a bit of code, but I'm not sure what you actually want me to do. Clearly it's awesome if your code is faster and has less code (if it's still readable). But that's not the point.

If you have something that you would like to get reviewed, just post it in a PR. That would really help. There we could easily discuss the good and bad points. There's just not so much I can say here, because I don't have a very good comparison.