Closed benhoyt closed 8 years ago
I see, so the idea is to have something like:
pss --type-add=tmpl=.tmpl --cpp --tmpl FOO
Right?
Sounds good to me. Something the patch would have to consider though: nice way to add both extension and regex new types (or just regexes? since extensions can be done with them - i.e. see --scons)
Yeah, that's right. Though --cpp
is already searched by default, right? In my case I'd probably just do:
pss --type-add html=.tmpl FOO
Regexen for types -- good point. I don't think ack supports that, so that'd be a bonus pss feature. :-) Perhaps "regex literal syntax" ala Perl and JavaScript:
pss --type-add scons=/(SConstruct|.scons$)/
Yes, if you want to search all known types then the above should work.
As for the regex syntax, I think quotes are a better (for shell healthiness and consistency with other pss capabilities) wrapper. Also, pss uses optparse
so it will probably be easier to do something like: --type-add=typename=regex
(note the extra =
).
Optparse actually works fine with or without the extra =
. For example, pss currently understands both --context 1
and --context=1
as equivalents. I'm not 100% sure where this is documented, but it definitely works.
As for regex syntax, aren't quotes worse in terms of "shell healthiness", because then you get into double-quoting and escaping issues? What syntax are you thinking of exactly?
Yes, you're right about not needing =
with optparse.
As for the syntax, I just mean that the shell already processes quotes, so that's the natural thing to use if spaces may appear in the regex. This is how the other regex passing capabilities in pss
work now.
$ pss -g "filenamewith spaces.txt"
Anyhow, feel free to send a pull request or patch that implements this.
Oh, okay, yeah, shell quoting is fine for filenames with spaces in them. I was thinking you were asking how to distinguish regex TypeValue.PATTERN from string TypeValue.EXTENSION in the --type* commands, and was thinking the /REGEX/
syntax might work for that.
However, I realize that it's a bit harder than that, because what if someone adds a plain extension to a PATTERN, or vice versa? Maybe they should all be converted to regex patterns, so that EXTENSION .html
becomes PATTERN .*\.html
To add an extension, you use ack syntax:
--type-add html=.tmpl
To add a pattern, you include the slahes for regex syntax:
--type-add html=/FUNKY_REGEX/
What do you think?
For consistency with -g
and -G
, there should be no special adornments to regexes. Moreover, I suggest to only support regexes here and be done with it. A regex for an extension is a couple of extra characters, and most uses of --type-add
will be scripted anyways.
pss doesn't expose the "extension / regex" dichotomy to users too much, and I don't think there's a good reason to do this here.
Fair enough, and KISS is good. My only problem with that is that then it's not at all compatible with ack (but the same command-line option, so could be quite confusing). What about the slightly fuzzy "if it starts with a dot it's an extension, otherwise it's a regex"? None of the PATTERNs in driver.py start with a dot, and I don't really see that being useful.
Alternatively we could go with a different option name, but that's less useful as an ack replacement.
Being exactly compatible with ack isn't really a goal of pss, at least not at this point. I see no reason to use both, and as for switching from one to another - IMHO it's nice that the high-level features are compatible, but once you get into the nitty details differences are OK. In fact pss already has a number of places where it deviates from ack. So I'd favor internal consistency within pss here.
Okay, I've done this now in two commits over at the benhoyt/pss fork:
--type-set
and --type-add
options. I decided to make it support extensions anyway partly so it's compatible with ack, but mainly because it's just a much nicer UI (scripted or not) to say --type-add html=.tmpl
than --type-add html=.+\.tmpl$
. If the pattern starts with a dot, it's an extension, otherwise it's a regex.What do you think?
I guess now I need to add some tests and make them pass! :-)
The separation of extensions and regexes is there for performance reasons. To make the first change, you'll have to convince me performance remains the same (or almost the same). Even though pss is usually fast, speed is a very important feature for it and searching large code bases is common.
Heh, fair call. I'm not sure yet, but I have a hunch that the all-regex approach may be faster, as it's essentially a single call to re.search
(which is written in C), whereas the extension approach requires a call to os.path.splitext()
, which is a fair bit of Python code. I'll test it, though!
Okay. I was dead wrong, sorry. I didn't realize how the |
regex operator worked -- when you do A|B|C|...
it loops through each regex from left to right until one matches. I was thinking it'd compile that down to a state machine and hence be at least as fast as the O(1)
set lookup. The regex version is several to many times slower as a result (though quite variable, I guess because re
releases the GIL and the OS often does other stuff.) So back to the drawing board on that first commit.
I think what I'd want to do is have each type spec be a list of extensions and a list of regexen, either of which could be empty. The problem with the existing approach is that you can't add a regex --type-add
to an extension type, and vice versa.
I think you're right about the implementation of |
in Python. Oh, this is hugely disappointing. I'll ask around about it (maybe the new regex
module will fix it). But yeah, extensions really help speed ATM.
Okay, I reverted those changed in my fork, and I've fixed this a better way in this commit that allows the type specs to have both extensions and patterns. Again, it simplifies the code in driver.py a fair bit.
Then I've re-added --type-add
and --type-set
in this commit.
Thoughts?
Heh, I realized that because I'm running pss via a batch script that already uses python -c ...
, pss already "has support" for adding and modifying types:
@python -c "from psslib.pss import main; from psslib.driver import TYPE_MAP; TYPE_MAP['html'] = whatever_i_want; main()" %*
Not very nice though. :-)
Just a heads up that this will take me some time to get to - overbooked with other stuff ATM.
Pulled the refactoring from https://github.com/benhoyt/pss/commit/e8aec55b8adc9c946024aab9f681990ce2aac340
I thought about this a bit and I still prefer to simplify, by only allowing a regex there. It's simpler in a number of ways, at the expense of maybe ack compatibility (which is not hugely important) and slightly shorted typing. But these new options are likely to be useful in scripting anyway. Besides, who knows, maybe someone wants to specify a regex for a file that begins with "any letter" (the dot). The inability to do so in the current scheme is ugly.
Would you like to submit a (single commit) pull request with tests for this?
Thanks for pulling the refactoring.
I don't love the "regex only" as supporting both is not really complicated, and \.ext$
is just a nasty way to have to write the extension .ext
(I don't use regexen daily, and always forget whether $
means end of string or start of string). But it's your tool, so I'll run with it. :-)
I'm okay with breaking ack compatibility, but I think it would be wise to give the option a different name in that case. What about either:
1) Add a single --type-pattern TYPE=PATTERN
option to add regex PATTERN to a given TYPE. If the type name doesn't exist, add it. As I see it, there's no real need for type options with ack's --type-set
behaviour anyway.
2) Keep ack-like --type-set
and --type-add
, but make the regex syntax /surrounded-by-slashes/
ala JavaScript/Perl instead of "if it doesn't start with a dot". This is a strict superset of ack (a file extension can't start with a slash) and avoids the issue of someone wanting to specify a regex for a file that begins with a regex dot (any letter).
I'm for option 2, which is only a few lines of code different from my existing patch, but I'll run with options 1 if you feel strongly about that.
(1) SGTM. Perhaps the option can be named --type-add-pattern
and the behavior of creating the type if it didn't previously exists makes sense too.
Okay, cool -- I'll see about adding writing code and tests for that in the next few days.
--type-add-pattern
added in this commit. I had a bit of trouble quickly adding tests for it, but let me know if you'd like, and I can spend a bit more time on that.
Ben, sorry for being quiet for a while. If you're still interested in this and have time to prepare a complete pull request (with tests, etc), please go on
Hi eliben, I'm sorry -- I'm in the middle of moving half way around the world and won't be getting to this anytime soon. I'm using it locally and it's great, but won't have time for doing this right now.
It's OK, @benhoyt, no worries. Have a safe move.
Hope your move went well. Still interested in putting the finishing touch here?
Sorry, this isn't an issue for me anymore and I'm kinda working on other projects now, so I have no real desire to get this into shape now. Sorry!
OK, no worries
Some projects I work on use a bunch of not-very-standard file extensions, such as
.tmpl
for mixed HTML templates. With ack I could use--type-add html=.tmpl
to include this.It would be great to have this feature in pss. It doesn't need to be permanent, because I always run pss via a customized script with my own options anyway (essentially an ackrc replacement).
One can kind of achieve this behaviour with a combination of
-a
and-G
, but then it only searches .tmpl files, whereas I want to include them in all my searches by default. Basically--type-add
would add toTYPE_MAP
.Let me know if you'd like a patch for this. I think it'd be pretty straight-forward to add these options and have them adjust TYPE_MAP.