emacs-languagetool / flycheck-languagetool

Flycheck support for LanguageTool
GNU General Public License v3.0
54 stars 8 forks source link

Add ability to filter markup before passing to LT #23

Closed kennyballou closed 1 year ago

kennyballou commented 1 year ago

LanguageTool (LT) can be picky about markup, which increases false positives. I think for now, I can probably ignore it, but I wanted to create an issue to document what I've tried and see if there's any interest in pursuing a better solution (as my solution is neither good nor complete).

Authoring papers, I would like to be able to integrate LT into my workflow, of the options, I like the flycheck version the most. However, since I often author papers in \LaTeX or similar, I would prefer to remove the markup before passing the text to LT. To that effect, I've created a function, detex-filter:

(defun detex-filter (contents)
  "Return buffer contents without (La)TeX commands via `detex'."
  (with-temp-buffer
    (goto-char (point-min))
    (insert contents)
    (call-process-region (point-min) (point-max) program t t)
    (buffer-substring-no-properties (point-min) (point-max))))

And I modified the POST parameters in flycheck-languagetool--start:

- ("text" . ,(buffer-substring-no-properties
+ ("text" . ,(detex-filter (buffer-substring-no-properties

And this "works" for false alarms like \textit{e.g.,} for example. However, there are two issues that this raises: one, how to map points back to the real source? For example, detex filters out the correct expressions in a sentence, transforming something like "\acl{AI}\cite{reference} supports many tasks." to "AI supports many tasks". But now the warning about "many" not being the right quantifier maps onto \cite, which is the wrong point within the original buffer. It is likely my ignorance, but I'm not aware of the correct way of handling these kinds of mapping issues.

Two, extensibility. LT is helpful in more than just \LaTeX modes. Therefore, there needs to be a way to select an appropriate (set of) filter(s) for the current mode. Also likely a lack of knowledge, so there may be solid existing solutions to this, but my limited look at other flycheck checkers didn't turn up anything.

Thoughts?

mavit commented 1 year ago

For LaTeX, have you investigated YaLafi, as mentioned in the LanguageTool documentation? I haven't tried it, but I imagine you should be able to set flycheck-languagetool-server-command to python -m yalafi.shell --server my --as-server 8082 --lt-directory /path/to/LT, flycheck-languagetool-server-port to 8082, and away you go.

In the general (non-LaTeX) case, this is essentially the same problem as #22.

kennyballou commented 1 year ago

@mavit: thank you for the suggestion, I didn't quite see how to mix the two, and you have shown a clear path forward. I'll give YaLafi a try. But I wonder, would it not also be susceptible to the same mapping issue described above?

andreyorst commented 1 year ago

In the general (non-LaTeX) case, this is essentially the same problem as https://github.com/emacs-languagetool/flycheck-languagetool/issues/22.

This is still a pain point for other markup-based modes, like Markdown or Org Mode. And in cases of code blocks, nested in such files, the number of false positives can be daunting.

Ideally, I think, it would be reasonable to replace the contents of all code blocks, verbatims, and links with an equal amount of spaces before sending.

In other words, I'm suggesting that this Org buffer:

Here's a variable =foobar=:

#+begin_src emacs-lisp
(defvar foobar nil)
#+end_src

We can use it as...

should be sent to LT as:

Here's a variable         :

We can use it as...

This would preserve offset calculation, though there might be some false positives on consecutive whitespace.

kennyballou commented 1 year ago

@mavit's suggestion works great for LaTeX mode; no mapping issue, just works(tm). However, I'm going to close this in favor of #22 since YaLafi won't solve checking in general. Specifically, the Org example provided by @andreyorst is particularly relevant. LanguageTool does not handle source blocks well. My approach above should work if I inject an equal amount of space, but it quickly becomes an issue of correct assignment of responsibility.

I can't remember the thread, but I have been looking for examples of how other checkers might solve this issue and I recall a reference to web-mode which inverts the issue; instead of the checker filtering the buffer itself and discerning what's relevant, the major mode is smart enough to only pass certain buffer regions to the relevant checkers. This approach may be more beneficial to Org mode explicitly, but maybe not to the code comments of #22. In either case, this checker would need to be able to parameterize a region instead of sending the whole buffer:

("text" . ,(buffer-substring-no-properties
            (point-min) (point-max))))