aboutcode-org / typecode

7 stars 9 forks source link

Incorrect Pygments lexer guessed #8

Open pombredanne opened 3 years ago

pombredanne commented 3 years ago

See https://github.com/pygments/pygments/issues/1563

With the attached file (to rename to a .JAVA UPPERCASE extension) the lexer is guessed incorrectly as Python. See Logger.JAVA.txt

When the extension is lowercased to '.java`, the lexer is guessed OK from the filename. In all cases, the content-based guess is not correct

$ python
Python 3.6.10 (default, Jun 13 2020, 08:53:46) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pygments
>>> pygments.__version__
'2.7.1'
>>> from pygments import lexers 
>>> fn='Logger.JAVA'
>>> code=open(f).read()
>>> lexers.get_lexer_for_filename(fn, code)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/lib/python3.6/site-packages/pygments/lexers/__init__.py", line 210, in get_lexer_for_filename
    raise ClassNotFound('no lexer for filename %r found' % _fn)
pygments.util.ClassNotFound: no lexer for filename 'Logger.JAVA' found
>>> lexers.guess_lexer(code)
<pygments.lexers.Python2Lexer>
>>> fn='Logger.java'
>>> lexers.get_lexer_for_filename(fn, code)
<pygments.lexers.JavaLexer>
thatch commented 3 years ago

The constraints around guess_lexer solely on contents need a complete rewrite; for speed it does not actually even try to tokenize the source, and although it can return a float (that is then ranked) there is no real balancing among the lexers written by different people.

I would suggest lowercasing before get_lexer_for_filename as a very reasonable workaround.

ashiscs commented 3 years ago

Hello I am new to this community Can I work on this issue?