Open mbroedl opened 6 years ago
This code is a bit nicer. Now keeping line breaks.
// ignoring comments, quotes, and fenced code, not ignoring punctuation
let scopes = ['comment.*', 'quote.*', '!punctuation.*', 'fenced.code.*'];
let editor = atom.workspace.getActiveTextEditor();
let scopeselector = require('first-mate').ScopeSelector;
let selector = new scopeselector(scopes.filter(s => !s.startsWith('!')).join(', '));
let text = editor.displayBuffer.tokenizedBuffer.tokenizedLines.map(
line => line.tokens.filter(
token => !selector.matches(token.scopes)
).map(
token => token.value
).join('')
).join('\n');
console.log(text.match(/\S+/g).length);
As the scope selector functions like css selectors (connection with commas), there is two ways to ignore something (or deactivate temporarily, see above): either filter it out from the list before (here done via an exclamation mark), or add a random word such as not, so that 'not punctuation.*' can never be matched and is thus ignored. The first option is cleaner, the second one easier.
@mbroedl Also for testing my particular case (usage of this plugin with Fountain, I write a bit of code which exclude things which should be not be considered in time evaluation of .fountain
files to see how it could be implemented. It can surely be more clean with more advanced regex, but it gives an idea:
stripText: (text, editor) ->
...
arr = text.split('\n')
out = []
for i of arr
if arr[i].match(/[a-z]/gmu) && !arr[i].match(/^[=|#]/gmu)
out.push arr[i]
out = out.join('\n')
text = out.replace(/(\/\*[^*]*\*\/)|(\/\/[^*]*)/g, '')
text
So here my conclusion about that:
Having to maintain a different fork of the plugin just to add minimal code snippet is too bad. Filters right in settings can be good, but having the possibility to extend the filtering with custom code snippet in the init.coffee file will be better. It will allows to have complex functions, and to limit them to certain grammar.
I think this concept of hook is called Service Hub.
This is how I extended some Pigments function.
Here is the doc:
https://github.com/atom/service-hub
Or maybe just Services
https://flight-manual.atom.io/behind-atom/sections/interacting-with-other-packages-via-services/
(I don't know if there is a diff)
@X-Raym Although I prefer out-of-the-box packages, I see the point of your suggestions. But could you explain how you would want to use services here?
@mbroedl The point would be to be able to let the user write its own text filtering functions in his init.cofee file. I know some package allows such features, like pigments to add custom named colors. Though, I don't know how to implement that.
But regarding to your proposition, it is able to get the syntax of the line, it is not just text filtering right ?
In this case, we could imagine a filter at this level, and a filter at the text variable level (filtering there doesn't allow to get original syntax).
I see, yeah. I haven't implemented something like that yet, but it shouldn't be too bad?
Here's the logics in pigments, I suppose: https://github.com/abe33/atom-pigments/blob/master/package.json#L20 https://github.com/abe33/atom-pigments/blob/master/lib/pigments.coffee#L121 https://github.com/abe33/atom-pigments/blob/master/lib/pigments-api.coffee
Yes, my proposition gets the tokenised syntax (i.e. what your grammar defines, such as fountain, markdown, asciidoc, or whatever) and then filters everything required. So everything you see when syntax highlighting (and whatever is defined but not highlighted).
So comment.*
would filter anything that the currently activated grammar tags as a comment, for example comment.md
or comment.fountain
or comment.py
(don't know if the latter exist?). And then you can extend it or specify it more, for example: comment.fountain
would only exclude fountain comments but not markdown comments from the wordcount. And speaker.annotation.fountain
(I made this one up) could ignore every speaker declaration...
So I suppose if your fountain grammar includes all the regex already and tokenises it properly, there might not even be a need to have a service?
Although there might be cases where you would want to ignore stop words in the word count, or prepositions, but not nouns... ok, that's getting well advanced.
@mbroedl I have to admit having tokenized syntax would make filtering far more easy, as I would have have to reimplement regexes at the filter level. So Yes, this can be a very good solution too ! I would only have to put a list of desired syntax in a text field in the settings right ? That would be very quick to set up :) Maybe one text field for adding syntax, and one to exclude some (depending on how many syntax you need, one would be quicker to set up).
ok, that's getting well advanced.
It's getting interested :D Having tokenzied syntax filter would satisfy my immediate needs, a text filter is really for advanced usage.
I envisioned something like that on rightclick, although this'd be a lot of effort to implement.
They'd probably be stored as which should be changeable in the settings, as well:
comments.*, !*.todo, quote.*, punctuation.*, fenced.code.*, !header.*
What do you mean with 'a text field for adding syntax'? Like a regex to further exclude something that is not defined in the grammar?
@mbroedl oh no I imagine a simple text input fields in word count settings with a csv list of syntax to keep or exclude. But your solution is from far the most user friendly and flexible, I like it !
Yeah, I suppose your suggestion would be a first implementation. And only then comes the UI to it, when it's needed and someone had time to implement it. But that probably can wait way longer.
I wouldn't actually keep two fields, because of the excessive amount of grammars. There'd be too much that's in neither list.
So I would suggest only to have a field to exclude grammars?
At least in my case that would work well, I don't know about you. In markdown there would be no grammar that would be ideally included. As in: everything is text.md
but normal text does not have any extra specifications apart from that, and then everything is included, just like it would normally. (Even within fenced code blocks, the text.md grammar is still present!)
What do you think? Do you have any use cases for when it would be reasonable to do an inclusion list instead? (Apart from the activation-grammar list that already exists?)
@mbroedl in fountain there is only few fields which are valuable for wordcount : dialog text, description, and transition. The rest have to be exluded. But fortunatly there isnt too much syntax element so it will be possible to do this with an exclusion list. Longer but possible.
But I think Inclusion list is thesafest choice. It gives less surprise. "Just put there what you want", no surprise with syntax which should have been excluded but you didnt notice it. The counter simply displays what you want. (I. S dont know if I explained that clearly, let me know ^^)
Mhmmm... do all the things you want to exclude have some kind of grammar structure in common that it'd be easy to exclude them?
For Markdown (and other grammars I am aware of) there is no such thing as: 'this is actual text without any fuss', but the whole document carries the same label, even if it's links, markup, etc. So at least in the case of markdown it would actually be impossible to create an inclusion list. And I guess usually they are written (and intended) to mark out things that are noteworthy, such as highlights, links, headings, etc.
Is the whole fountain document tokenised in depth? (You can check by opening the drop down console and checking the output of 'log cursor scope' at any given position and see whether the output differs between dialogue text, description, etc.) When you do that in a markdown document, you just have a main category 'text.md' that accounts for every single character in the document. (see the example below)
I'll give you an example. The file.
# Hello
This is an _example_.
Translates to this hierarchical syntax tree with the language-markdown package:
text.md
|— heading.markup.md, heading-1.md
|— (#) punctuation.md
|— (Hello) [no further tokens]
|— (This is an ) [no further tokens]
|— emphasis.italic.markup.md
|— (_) punctuation.md
|— (example) [no further tokens]
|— (_) punctuation.md
|— (.) [no further tokens]
So to count all words in this case I would need to exclude all punctuation below the document level (text.md). To count everything that is in italics, I need to only count emphasis.italic.markup.md but may still want to exclude punctuation.md... in this case there would be no way to do this only via exclusion, because for example "This is an " does not have any tokens that the italics part does not have...
So although apart from your example (and the hypothetical example I made) I don't think there is many cases for requiring a certain grammar to be counted.
But I start to think it's reasonable to have the option (for whatever use case).
I mean programatically it would not be any real extra effort to both include and exclude.
But only listing inclusion might create issues in your case as well (as explained above), because for example the punctuation.*
(markers for bold, italics, etc) would be tokens on a level below the dialogue. So they would be included if you just did inclusion.
Looks like we're starting to understand each other!
I just realised that if you go via inclusion you could only ever count words in your fountain file (assuming other grammars do not have dialogue.* markers, for example)? Would that be acceptable?
@mbroedl I see the inclusion problem. Not sure how to tackle this.
Here is scope log on my Fountain file, with dialog name, dialog text, and description, (things I want in my count):
entity.name.character.fountain
string.other.dialogue.fountain
meta.paragraph.text
It seems indeed to have generic text too.
We could possibly just pass on whatever is in the text field to the scope selector? https://github.com/atom/first-mate/blob/master/spec/scope-selector-spec.coffee#L33
So I guess in your case you could have (haven't trialled it):
dialogue | character | paragraph
if you don't care about any other wordcount, whereas if you would like to count also in Markdown then possibly with
(-fountain & *) | (fountain & ( ... )
Forwarding the text field into the scope selector would certainly be most powerful, but might be harder to debug for the user (cannot find a good scope selector documentation), and also it would remove any kind of possibility to manage that conveniently in a context menu (if users would apply complex patterns, that is).
Maybe one could use the ,
to delineate blocks to turn on and off? (Although commas and disjunctions (OR -> |) are both allowed in grouped brackets, but that should be manageable?)
EDIT: When using a comma with a positive (inclusive) match, then in my case for excluding comments AND quotes it'd need to be -comment & -quote
, whereas with a negative (exclusive) match it would be easier: comment, quote
and therefore might be better in a UI sense? Then I assume you would need to wrap your statement into a -(...)
and it should all be good?
I tend towards going towards an exclusive option because this is likely to be the use case for most people?
@mbroedl Don't you think the syntax for the could only be a CSV wth atom scope names like that ?
entity.name.character.fountain, string.other.dialogue.fountain, meta.paragraph.text
?
Yes, but:
a) then you need two fields one for count-only and one for exclude-from-count b) you will not be able to count anything else but your fountain files
I believe (b) is a highly problematic thing and it would require a huge warning which shouldn't be the aim of a seemingly easy word count program. This could be resolved by adding extra scopes, etc. And (a) and (b) could be resolved by using the ScopeSelector. So why not use it?
Most people will in fact be able to use it like you suggested, because the Scope Selector would function exactly this way (although I'd like to inverse it, as said). as they want to exclude things (puncutation, language, grammar, etc) from their count, and not only count certain things.
What could be a compromise is forward the stuff directly to the scope selector, and then have a checkbox to invert it:
A few hours later... (I really shouldn't have time for this, haha.) Cropping the tokenised text to the selection was really touch, but now it should all work well.
@X-Raym : Could you try whether mbroedl/atom-wordcount works for you?
I believe you want to use
entity.name.character.fountain, string.other.dialogue.fountain, meta.paragraph.text
in the text field and then deactivate the Negate Scope Selector box.
@mbroedl Hmm not sure how to install this, I download and extract the zip of you fork, put them with other atom package, but it says "Cannot find module 'first-mate'"
Why do me I miss to cleanly install package from forked repo ?
In the folder of that package go in the command line and do apm install
to install that missing package.
Otherwise, next time you could try this: apm install mbroedl/atom-wordcount
(Note: none of these ways link the package to the development mode in atom, so make sure you keep a backup!)
hmmm apm install mbroedl/atom-wordcount
don't work, it just kills the cmd.exe window, without installing it, just after a git cloning message.
apm install
from the downloaded extracted zip also breaks.
> oniguruma@6.2.1 install E:\Bureau\atom-wordcount-master\node_modules\oniguruma
> node-gyp rebuild
E:\Bureau\atom-wordcount-master\node_modules\oniguruma>if not defined npm_config_node_gyp (node "C:\Users\USER\AppData\Local\atom\app-1.27.2\resources\app\apm\node_modules\npm\bin\node-gyp-bin\\..\..\node_modules\node-gyp\bin\node-gyp.js" rebuild ) else (node "C:\Users\USER\AppData\Local\atom\app-1.27.2\resources\app\apm\bin\\..\node_modules\node-gyp\bin\node-gyp.js" rebuild )
G�n�ration des projets individuellement dans cette solution. Pour activer la g�n�ration en parall�le, ajoutez le commutateur "/m".
MSBUILD : error MSB3428: Impossible de charger le composant Visual�C++ "VCBuild.exe". Pour corriger le probl�me, vous devez 1) installer le Kit de d�veloppement .NET Framework�2.0�SDK, 2) installer Microsoft Visual Studio�2005 ou 3) ajouter l'emplacement du composant au chemin d'acc�s syst�me, s'il est install� ailleurs. [E:\Bureau\atom-wordcount-master\node_modules\oniguruma\build\binding.sln]
wordcount@3.0.0 E:\Bureau\atom-wordcount-master
+-- es5-ext@0.10.45 extraneous
+-- lodash@3.10.1
`-- word-regex@0.1.2
npm WARN deprecated coffee-script@1.12.7: CoffeeScript on NPM has moved to "coffeescript" (no hyphen)
gyp ERR! build error
gyp ERR! stack Error: `C:\Windows\Microsoft.NET\Framework\v4.0.30319\msbuild.exe` failed with exit code: 1
gyp ERR! stack at ChildProcess.onExit (C:\Users\USER\AppData\Local\atom\app-1.27.2\resources\app\apm\node_modules\node-gyp\lib\build.js:276:23)
gyp ERR! stack at emitTwo (events.js:106:13)
gyp ERR! stack at ChildProcess.emit (events.js:191:7)
gyp ERR! stack at Process.ChildProcess._handle.onexit (internal/child_process.js:215:12)
gyp ERR! System Windows_NT 10.0.17134
gyp ERR! command "C:\\Users\\USER\\AppData\\Local\\atom\\app-1.27.2\\resources\\app\\apm\\bin\\node.exe" "C:\\Users\\USER\\AppData\\Local\\atom\\app-1.27.2\\resources\\app\\apm\\node_modules\\node-gyp\\bin\\node-gyp.js" "rebuild"
gyp ERR! cwd E:\Bureau\atom-wordcount-master\node_modules\oniguruma
gyp ERR! node -v v6.9.5
gyp ERR! node-gyp -v v3.4.0
gyp ERR! not ok
npm ERR! Windows_NT 10.0.17134
npm ERR! argv "C:\\Users\\USER\\AppData\\Local\\atom\\app-1.27.2\\resources\\app\\apm\\bin\\node.exe" "C:\\Users\\USER\\AppData\\Local\\atom\\app-1.27.2\\resources\\app\\apm\\node_modules\\npm\\bin\\npm-cli.js" "--globalconfig" "C:\\Users\\USER\\.atom\\.apm\\.apmrc" "--userconfig" "C:\\Users\\USER\\.atom\\.apmrc" "install" "--runtime=electron" "--target=1.7.15" "--arch=x64"
npm ERR! node v6.9.5
npm ERR! npm v3.10.10
npm ERR! code ELIFECYCLE
npm ERR! oniguruma@6.2.1 install: `node-gyp rebuild`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the oniguruma@6.2.1 install script 'node-gyp rebuild'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the oniguruma package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get information on how to open an issue for this project with:
npm ERR! npm bugs oniguruma
npm ERR! Or if that isn't available, you can get their info via:
npm ERR! npm owner ls oniguruma
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR! E:\Bureau\atom-wordcount-master\npm-debug.log
Is there another way to install it ?
I simply try to copy paste modified file but there is still this first-mate error.
Try apm rebuild
. Not an expert in windows, and neither with apm though. The oniguruma package is part of the wordcount package, so not related to my changes. :/
@mbroedl I tried apm rebuild
before or after the apm install
without success.
How did you made the change on your side ?
I just reinstalled it to check, and in my case I just go into the directory of the package and type apm install
and then it worked. :/
Sorry I cannot help more... :/
You could try npm install
instead, but i don't know if that would make a difference, as apm just forwards all commands...
@mbroedl when you say directory of the package, it can be anywhere where the zip has been extracted right ? even simple E:\Desktop\atom-wordcount-master ? amp install
is supposed to work from anywhere ?
Ok, so you would execute that in the folder of the package.
E:\Desktop\atom-wordcount-master> apm install
to install all the modules needed for the module*.
Then you would link it apm link
if you want it activated always, or apm link --dev
if you want it to be activated in dev mode which is usally recommended.
If you linked it to the dev mode, you need to start atom as such (e.g. atom --dev
)
apm help install
etc:
Install the given Atom package to ~/.atom/packages/
. If no package name is given then all the dependencies in the package.json file are installed to the node_modules folder in the current working directory.
@mbroedl still no luck. I think there is something I am missing there. I will not bother you more with that so you can focus on actual feature development (which should be way more stimulating), and try to see if I can find some doc somewhere.
I hope other people will join the discussion and be more lucky with the installation !
@mbroedl I can't test but I can still provide feedbacks on integration ideas ! 😄
@mbroedl Have you tested your mod with other file than fountain ? How well does it perform ?
Also, as far as I can see, this will not answer all problem. I think having a function to modify the text variable output could still be nice, it case there is things to exclude which are not marked by a synthax (you spoke about link in Markdown ?)
@mbroedl - are you still working on this? Looks like a solid approach and would add tremendous value. Let us know if you need help. I'd be happy to install and test as well.
@davidlday I haven't been working on it since my last post here, but the approach as above is still up and running and I'd say fully functional. I just rebased it on the master branch.
You should be able to use this: mbroedl/atom-wordcount
I don't know if this is of any use for you, but I currently strip these scopes for the wordcount: comment, fenced.code, punctuation, destination.link, front-matter.yaml.source.md
Let me know what you think!
PS. I don't know if the package will work with the tree-sitter parsers or only with the traditional regex one, as I'm only using the latter. The approach should be the same, but I believe the interface to the tokenised buffer is different.
EDIT: Note that I've implemented a settings-only approach, and the fancy graphics I made above never became reality...
@mbroedl I'll see if I can do some testing later today.
@mbroedl I'm on macOS Mojave and I'm getting the following when doing apm install
:
xcode-select: error: tool 'xcodebuild' requires Xcode, but active developer directory '/Library/Developer/CommandLineTools' is a command line tools instance
xcode-select: error: tool 'xcodebuild' requires Xcode, but active developer directory '/Library/Developer/CommandLineTools' is a command line tools instance
Traceback (most recent call last):
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/gyp_main.py", line 16, in <module>
sys.exit(gyp.script_main())
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py", line 545, in script_main
return main(sys.argv[1:])
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py", line 538, in main
return gyp_main(args)
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py", line 523, in gyp_main
generator.GenerateOutput(flat_list, targets, data, params)
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/generator/make.py", line 2170, in GenerateOutput
part_of_all=qualified_target in needed_targets)
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/generator/make.py", line 795, in Write
self.Pchify))
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/generator/make.py", line 1190, in WriteSources
cflags = self.xcode_settings.GetCflags(configname)
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/xcode_emulation.py", line 551, in GetCflags
archs = self.GetActiveArchs(self.configname)
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/xcode_emulation.py", line 420, in GetActiveArchs
xcode_archs_default = GetXcodeArchsDefault()
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/xcode_emulation.py", line 118, in GetXcodeArchsDefault
xcode_version, _ = XcodeVersion()
File "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/gyp/pylib/gyp/xcode_emulation.py", line 1265, in XcodeVersion
version = re.match(r'(\d\.\d\.?\d*)', version).groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'
gyp ERR! configure error
gyp ERR! stack Error: `gyp` failed with exit code: 1
gyp ERR! stack at ChildProcess.onCpExit (/Applications/Atom.app/Contents/Resources/app/apm/node_modules/node-gyp/lib/configure.js:305:16)
gyp ERR! stack at emitTwo (events.js:126:13)
gyp ERR! stack at ChildProcess.emit (events.js:214:7)
gyp ERR! stack at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12)
gyp ERR! System Darwin 18.2.0
gyp ERR! command "/Applications/Atom.app/Contents/Resources/app/apm/bin/node" "/Applications/Atom.app/Contents/Resources/app/apm/node_modules/.bin/node-gyp" "rebuild"
gyp ERR! cwd /Users/dday/Projects/github.com/mbroedl/atom-wordcount/node_modules/oniguruma
gyp ERR! node -v v8.9.3
gyp ERR! node-gyp -v v3.4.0
gyp ERR! not ok
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! oniguruma@7.0.2 install: `node-gyp rebuild`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the oniguruma@7.0.2 install script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
I'll try on one of my Linux boxes in a bit and let you know if I have any luck.
EDIT: Installs without problem on my Linux laptop (Ubuntu). Will do a little testing / validation as I have time over the next day or so.
EDIT2: And I managed to get it installed on my Mac as well. Problem with xcode-select. Nothing to do with your code.
@davidlday Thanks for sorting these issues out! The oniguruma package (which uses xcode) often causes trouble for me as well. Looking forward to your tests!
@davidlday Thanks for reviving this. Are you coordinating the merge of this? Please let me know if you need any more access/publish rights.
@OleMchls More than happy to help! Yes, I can coordinate merging but might tag you if I'm uncertain on something. Will also let you know if I hit any access issues.
@mbroedl Accessing tokenizedBuffer (lines 94 & 99 in wordcount-view.coffee) is an undocumented part of the API. I was going over the API docs this morning and think there might be another way of getting tokens.
Grammar provides two methods for tokenizing: tokenizeLines(text)
and tokenizeLine(line, ruleStack, firstLine)
.
Do you think something like this would work instead of editor.tokenizedBuffer.tokenizedLines
?
editorGrammar = editor.getGrammar()
editorGrammar.tokenizeLines(editor.getText())
There's no equivalent tokenizeLinesForRows()
, but maybe looping over lines and using the documented Grammar.tokenizeLine()
would work?
Take a look and let me know what you think.
@mbroedl I did a little poking around in the console, and it looks like this definitely won't work for tree-sitter grammars. getGrammar()
returns a TreeSitterGrammar
which doesn't provide any tokenization methods. I was able to emulate by doing:
tokenLines = []
for (i = 0; i < editor.getScreenLineCount(); i++) {
tokensLines[i] = editor.tokensForScreenRow(i);
}
However, editor.tokensForScreenRow()
is undocumented as well. I discovered it in atom's tree-sitter-language-mode-spec.js. This probably doesn't help much other than to capture the situation.
Thanks for doing some digging @davidlday !
The editor.tokensForScreenRow()
method exists for the regex parsers as well, so this might be a good start?
My suggestion would be to just implement it like that with a fallback option that counts the words without relying on tokens. In case of editor.tokensForScreenRow == undefined
one could then add a warning icon with a brief explanation next to the word counts? Maybe clicking on that icon could create a new issue (if not existing yet) or so?
For later stages, I assume that if Grammar.tokenizeLine
is documented, it might be implemented for the TreeSitter grammars as well. So we could always revise in half a year if things have changed and whether we would be more comfortable with documented functions (which doesn't seem to help at the moment with the tree sitter grammar though).
The main reason why I implemented it using the undocumented API (which I had found in other packages) is that I was worried about the overhead by tokenizing the editor once for rendering, and then tokenizing it again for counting words. Thought this would cause too much unnecessary overhead?
I think tokensForScreenRow()
with the fallback option sounds like a good start, and we can adjust as the API for grammars matures. From reading the docs and discussions, I'm not really clear on their direction or why these APIs aren't being made public yet. This issue on the atom repo has a little more info, although it's geared toward writing tests, not packages.
Regarding the undocumented API vs parsing twice, I don't really feel strongly one way or the other, but using undocumented APIs feels a little more risky since there's no contract around their behavior. I'd defer to @OleMchls on a final call, but really, either implementation is a step forward.
I tagged it in an issue thread on the linter-languagetool which is looking to do something similar. You might peek at that thread to see if there's anything of use there that I might have missed. Might even be opportunity for a base atom package that natural language extensions could use to cut out the noise of computer symbols.
Thanks for picking this back up, btw!
I see your concerns about the undocumented grammar. In my own repository I feel alright about it, but the atom-directory is a different thing altogether.
There were quite a few modifications necessary to adjust the parser to the tokensForScreenRow
API (different names, wrapping of screen rows, scopes for CSS not for first-mate, etc). Have a look at the commit for details! — I think it should be good to go like this. What I am wary about though is whether it makes sense to leave the scopes in the first-mate format, or just go for a different library instead which can match CSS classes, or maybe even implementing one ourselves. It would allow for less complexity that the current approach though.
Thanks for linking the thread! I don't really have time to write a base atom package (and the knowledge of how to do that), but having the tokens exposed in an API (maybe including selectors) would be a great functionality which I'm sure many people would make use of! — Might look into that once I'm a bit less busy.
I've also changed the default settings to work well with generic markup documents (i.e. ignoring comments and punctuation). It would be great if you could test the most recent version in my fork!
I'd defer to @OleMchls on a final call, but really, either implementation is a step forward.
That's a tough call. I would be ok to use the API, there is a way to define private functions. So, for the time being, we can use that, ideally, we could reach out to the atom team (maybe in the linked issue) and gather their opinion. If the API gets removed someday, there will be someone opening an issue here.
I want to take the opportunity to second @davidlday on this:
I think tokensForScreenRow() with the fallback option sounds like a good start,
👍
Resulting from a discussion with @X-Raym in the recently accepted PR #94 I had a bit of a thought on why it should not be possible to ignore scopes rather than re-inventing regexes that are implemented as grammars already.
Taking tokenised lines is theoretically possible. Together with a scope selector it is possible to filter (positively or negatively) all text elements that match certain scopes. This would further allow to ignore punctuation (much requested although now dealt with in a new word count regex), different grammars, etc (see e.g. #55 and #65 ).
See the following minimal example to filter out all scopes that are comments, quotes, or punctuation in their respective grammar. It should be copyable to the console of atom 1:1.
At the moment it reduces all lines into a single buffer to make looping easier. this destroys line breaks, and may thus not desirable, but I thought as a proof of concept this would be sufficient for now. This does change the final word count a little bit though. — It should be no difficulty to instead loop through the tokens and filter those that do not match the ignored selectors.
Maybe instead of the many current settings, one could have one text box where all to-be-ignored scopes are listed, and then a little pop-up menu on right-click on the word count (similar to the one for the minimap) would allow to activate/deactivate certain scopes in filtering. This would reduce bulk in the settings (see discussions elsewhere in this package).
Disclaimer: (A) The tokenised lines are not documented and thus subject to change without further notice, although the current selector seems to have been stable for at least two years. (B) I do not know how much the speed of calculation would suffer from this way of filtering (compared to regex, and compared to not filtering).
Looking forward to any discussions!