Open dennwc opened 5 years ago
Currently we "detect" language using bblfsh. So we need to parse it to get the lang.
I proposed to use js-based detection before for gitbase-web but it was discarded. In theory we can "parse" in background on paste event without updating UAST on the right side.
We can also expose language detection API on bblfshd
side to make this happen.
How would it work @dennwc ?
Should it try to guess the lang every time it is detected an onpaste
event?
Could it lead to weird language detections?
Should it only work when the whole input area is completely replaced with text from the clipboard? (I'm not sure if it can be done without workarounds)
What if the whole input area is completely replaced with a key press, or typing new code? should it guess the language again? example:
- code is auto, and guessed as java
- user selects the whole input text
- user types
#include <stdio.h>
→ should it be guessed that it is C
, or only if it would be pasted from clipboard?
I think detecting on input will be a bit too extreme. onpaste
looks practical enough, I guess.
How well will the detection we currently support work with just plain text and no filename?
It does not work almost at all. Bblfsh uses enry to detect language and enry detects it based on filename.
@smacker Enry also uses other heuristics. A filename is only one of them.
@smacker Enry also uses other heuristics. A filename is only one of them.
It does, but the filename seems to be a very important one. So my question was real—I am not sure how well Enry will do in the case where no filename is inferred.
@dennwc it's true. But the last time I checked it almost never could guess the lang correctly without filename. You can see it in gitbase-web. We actually use enry there. But still you have to choose language manually in 99.99% cases.
I tried with linguist and it couldn't recognize language by default also. But calling it with a list of candidates ("Go", "Python", "JavaScript", "Ruby", "Java" aka supported languages from bblfshd) worked for all my examples.
The same trick with enry didn't work most probably because content classifier in enry is very outdated.
We should return to this issue when enry is updated.
I'm wondering if it's possible to refresh the detected language when the code is pasted and the language is set to "Auto"?