Open pombredanne opened 5 years ago
Hi There! We can also use Regular Expression to Detect the correct Programming Language in which the code has been written. will Soon upload a source code of it , Working on it right now!
Some recent examples of errors for Programming Language (pygments):
@mjherzog thanks. I pushed an updated pygments library in 4aaec8c58d6813f7428b010bae1494fdf45ac5c8 but this is only a first baby step
I don't know how/if this factors in to a solution, but I would say that "false positives" are the main concern. It would be better for a .rST file to be reported as No Value Detected for Programming Language than a false positive for VB.Net.
Description
ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.
The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.
See https://github.com/nexB/aboutcode/wiki/GSOC-2019#improve-programming-language-detection-and-classification-in-scancode
Here are some actual tools for general filetype and Programming language detection: In use today:
( we also use a shannon entropy detector and binaryornot to detect binaries)
Things to look at could include :
See also: #1036 #1012 and #426 #1355 #1201