Auto-detect language when code is pasted

bblfsh / web

Web client for Babelfish server

http://dashboard.bblf.sh

GNU General Public License v3.0

23 stars 21 forks source link

Auto-detect language when code is pasted #190

Open dennwc opened 5 years ago

dennwc commented 5 years ago

I'm wondering if it's possible to refresh the detected language when the code is pasted and the language is set to "Auto"?

smacker commented 5 years ago

Currently we "detect" language using bblfsh. So we need to parse it to get the lang.

I proposed to use js-based detection before for gitbase-web but it was discarded. In theory we can "parse" in background on paste event without updating UAST on the right side.

dennwc commented 5 years ago

We can also expose language detection API on bblfshd side to make this happen.

dpordomingo commented 5 years ago

How would it work @dennwc ?

Should it try to guess the lang every time it is detected an onpaste event?

Could it lead to weird language detections?

Should it only work when the whole input area is completely replaced with text from the clipboard? (I'm not sure if it can be done without workarounds)

What if the whole input area is completely replaced with a key press, or typing new code? should it guess the language again? example:

code is auto, and guessed as java

user selects the whole input text

user types #include <stdio.h>

→ should it be guessed that it is C, or only if it would be pasted from clipboard?

dennwc commented 5 years ago

I think detecting on input will be a bit too extreme. onpaste looks practical enough, I guess.

creachadair commented 5 years ago

How well will the detection we currently support work with just plain text and no filename?

smacker commented 5 years ago

It does not work almost at all. Bblfsh uses enry to detect language and enry detects it based on filename.

dennwc commented 5 years ago

@smacker Enry also uses other heuristics. A filename is only one of them.

creachadair commented 5 years ago

@smacker Enry also uses other heuristics. A filename is only one of them.

It does, but the filename seems to be a very important one. So my question was real—I am not sure how well Enry will do in the case where no filename is inferred.

smacker commented 5 years ago

@dennwc it's true. But the last time I checked it almost never could guess the lang correctly without filename. You can see it in gitbase-web. We actually use enry there. But still you have to choose language manually in 99.99% cases.

smacker commented 5 years ago

I tried with linguist and it couldn't recognize language by default also. But calling it with a list of candidates ("Go", "Python", "JavaScript", "Ruby", "Java" aka supported languages from bblfshd) worked for all my examples.

The same trick with enry didn't work most probably because content classifier in enry is very outdated.

We should return to this issue when enry is updated.