eliben / pss

pss is a power-tool for searching inside source code files.
Other
327 stars 46 forks source link

Add support for ack's --type-add / --type-set or similar #4

Closed benhoyt closed 8 years ago

benhoyt commented 11 years ago

Some projects I work on use a bunch of not-very-standard file extensions, such as .tmpl for mixed HTML templates. With ack I could use --type-add html=.tmpl to include this.

It would be great to have this feature in pss. It doesn't need to be permanent, because I always run pss via a customized script with my own options anyway (essentially an ackrc replacement).

One can kind of achieve this behaviour with a combination of -a and -G, but then it only searches .tmpl files, whereas I want to include them in all my searches by default. Basically --type-add would add to TYPE_MAP.

Let me know if you'd like a patch for this. I think it'd be pretty straight-forward to add these options and have them adjust TYPE_MAP.

eliben commented 11 years ago

I see, so the idea is to have something like:

pss --type-add=tmpl=.tmpl --cpp --tmpl FOO

Right?

Sounds good to me. Something the patch would have to consider though: nice way to add both extension and regex new types (or just regexes? since extensions can be done with them - i.e. see --scons)

benhoyt commented 11 years ago

Yeah, that's right. Though --cpp is already searched by default, right? In my case I'd probably just do:

pss --type-add html=.tmpl FOO

Regexen for types -- good point. I don't think ack supports that, so that'd be a bonus pss feature. :-) Perhaps "regex literal syntax" ala Perl and JavaScript:

pss --type-add scons=/(SConstruct|.scons$)/
eliben commented 11 years ago

Yes, if you want to search all known types then the above should work.

As for the regex syntax, I think quotes are a better (for shell healthiness and consistency with other pss capabilities) wrapper. Also, pss uses optparse so it will probably be easier to do something like: --type-add=typename=regex (note the extra =).

benhoyt commented 11 years ago

Optparse actually works fine with or without the extra =. For example, pss currently understands both --context 1 and --context=1 as equivalents. I'm not 100% sure where this is documented, but it definitely works.

As for regex syntax, aren't quotes worse in terms of "shell healthiness", because then you get into double-quoting and escaping issues? What syntax are you thinking of exactly?

eliben commented 11 years ago

Yes, you're right about not needing = with optparse.

As for the syntax, I just mean that the shell already processes quotes, so that's the natural thing to use if spaces may appear in the regex. This is how the other regex passing capabilities in pss work now.

$ pss -g "filenamewith spaces.txt"

Anyhow, feel free to send a pull request or patch that implements this.

benhoyt commented 11 years ago

Oh, okay, yeah, shell quoting is fine for filenames with spaces in them. I was thinking you were asking how to distinguish regex TypeValue.PATTERN from string TypeValue.EXTENSION in the --type* commands, and was thinking the /REGEX/ syntax might work for that.

However, I realize that it's a bit harder than that, because what if someone adds a plain extension to a PATTERN, or vice versa? Maybe they should all be converted to regex patterns, so that EXTENSION .html becomes PATTERN .*\.html

To add an extension, you use ack syntax:

--type-add html=.tmpl

To add a pattern, you include the slahes for regex syntax:

--type-add html=/FUNKY_REGEX/

What do you think?

eliben commented 11 years ago

For consistency with -g and -G, there should be no special adornments to regexes. Moreover, I suggest to only support regexes here and be done with it. A regex for an extension is a couple of extra characters, and most uses of --type-add will be scripted anyways.

pss doesn't expose the "extension / regex" dichotomy to users too much, and I don't think there's a good reason to do this here.

benhoyt commented 11 years ago

Fair enough, and KISS is good. My only problem with that is that then it's not at all compatible with ack (but the same command-line option, so could be quite confusing). What about the slightly fuzzy "if it starts with a dot it's an extension, otherwise it's a regex"? None of the PATTERNs in driver.py start with a dot, and I don't really see that being useful.

Alternatively we could go with a different option name, but that's less useful as an ack replacement.

eliben commented 11 years ago

Being exactly compatible with ack isn't really a goal of pss, at least not at this point. I see no reason to use both, and as for switching from one to another - IMHO it's nice that the high-level features are compatible, but once you get into the nitty details differences are OK. In fact pss already has a number of places where it deviates from ack. So I'd favor internal consistency within pss here.

benhoyt commented 11 years ago

Okay, I've done this now in two commits over at the benhoyt/pss fork:

What do you think?

I guess now I need to add some tests and make them pass! :-)

eliben commented 11 years ago

The separation of extensions and regexes is there for performance reasons. To make the first change, you'll have to convince me performance remains the same (or almost the same). Even though pss is usually fast, speed is a very important feature for it and searching large code bases is common.

benhoyt commented 11 years ago

Heh, fair call. I'm not sure yet, but I have a hunch that the all-regex approach may be faster, as it's essentially a single call to re.search (which is written in C), whereas the extension approach requires a call to os.path.splitext(), which is a fair bit of Python code. I'll test it, though!

benhoyt commented 11 years ago

Okay. I was dead wrong, sorry. I didn't realize how the | regex operator worked -- when you do A|B|C|... it loops through each regex from left to right until one matches. I was thinking it'd compile that down to a state machine and hence be at least as fast as the O(1) set lookup. The regex version is several to many times slower as a result (though quite variable, I guess because re releases the GIL and the OS often does other stuff.) So back to the drawing board on that first commit.

I think what I'd want to do is have each type spec be a list of extensions and a list of regexen, either of which could be empty. The problem with the existing approach is that you can't add a regex --type-add to an extension type, and vice versa.

eliben commented 11 years ago

I think you're right about the implementation of | in Python. Oh, this is hugely disappointing. I'll ask around about it (maybe the new regex module will fix it). But yeah, extensions really help speed ATM.

benhoyt commented 11 years ago

Okay, I reverted those changed in my fork, and I've fixed this a better way in this commit that allows the type specs to have both extensions and patterns. Again, it simplifies the code in driver.py a fair bit.

Then I've re-added --type-add and --type-set in this commit.

Thoughts?

benhoyt commented 11 years ago

Heh, I realized that because I'm running pss via a batch script that already uses python -c ..., pss already "has support" for adding and modifying types:

@python -c "from psslib.pss import main; from psslib.driver import TYPE_MAP; TYPE_MAP['html'] = whatever_i_want; main()" %*

Not very nice though. :-)

eliben commented 11 years ago

Just a heads up that this will take me some time to get to - overbooked with other stuff ATM.

eliben commented 11 years ago

Pulled the refactoring from https://github.com/benhoyt/pss/commit/e8aec55b8adc9c946024aab9f681990ce2aac340

eliben commented 11 years ago

I thought about this a bit and I still prefer to simplify, by only allowing a regex there. It's simpler in a number of ways, at the expense of maybe ack compatibility (which is not hugely important) and slightly shorted typing. But these new options are likely to be useful in scripting anyway. Besides, who knows, maybe someone wants to specify a regex for a file that begins with "any letter" (the dot). The inability to do so in the current scheme is ugly.

Would you like to submit a (single commit) pull request with tests for this?

benhoyt commented 11 years ago

Thanks for pulling the refactoring.

I don't love the "regex only" as supporting both is not really complicated, and \.ext$ is just a nasty way to have to write the extension .ext (I don't use regexen daily, and always forget whether $ means end of string or start of string). But it's your tool, so I'll run with it. :-)

I'm okay with breaking ack compatibility, but I think it would be wise to give the option a different name in that case. What about either:

1) Add a single --type-pattern TYPE=PATTERN option to add regex PATTERN to a given TYPE. If the type name doesn't exist, add it. As I see it, there's no real need for type options with ack's --type-set behaviour anyway.

2) Keep ack-like --type-set and --type-add, but make the regex syntax /surrounded-by-slashes/ ala JavaScript/Perl instead of "if it doesn't start with a dot". This is a strict superset of ack (a file extension can't start with a slash) and avoids the issue of someone wanting to specify a regex for a file that begins with a regex dot (any letter).

I'm for option 2, which is only a few lines of code different from my existing patch, but I'll run with options 1 if you feel strongly about that.

eliben commented 11 years ago

(1) SGTM. Perhaps the option can be named --type-add-pattern and the behavior of creating the type if it didn't previously exists makes sense too.

benhoyt commented 11 years ago

Okay, cool -- I'll see about adding writing code and tests for that in the next few days.

benhoyt commented 11 years ago

--type-add-pattern added in this commit. I had a bit of trouble quickly adding tests for it, but let me know if you'd like, and I can spend a bit more time on that.

eliben commented 10 years ago

Ben, sorry for being quiet for a while. If you're still interested in this and have time to prepare a complete pull request (with tests, etc), please go on

benhoyt commented 10 years ago

Hi eliben, I'm sorry -- I'm in the middle of moving half way around the world and won't be getting to this anytime soon. I'm using it locally and it's great, but won't have time for doing this right now.

eliben commented 10 years ago

It's OK, @benhoyt, no worries. Have a safe move.

eliben commented 8 years ago

Hope your move went well. Still interested in putting the finishing touch here?

benhoyt commented 8 years ago

Sorry, this isn't an issue for me anymore and I'm kinda working on other projects now, so I have no real desire to get this into shape now. Sorry!

eliben commented 8 years ago

OK, no worries