aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.12k stars 548 forks source link

Incorrect programming language reported for a config.h.in file #605

Open pombredanne opened 7 years ago

pombredanne commented 7 years ago
FileName: ./config.h.in
IsSource: True
IsScript: False
ProgLanguage: JavaScript+Lasso

From https://raw.githubusercontent.com/vysheng/tg/master/config.h.in if I upgrade pygments to 2.2.0 the programming language is "Rexx" We need to supplement this with a file name and extension registry: after all config.h.in is kinda well known autotools pattern

This is also closely related to #426

Reported by tglx

pombredanne commented 6 years ago

This is now reported as "ASCII text" which is quite acceptable given that this is a generated file.

D-lang14 commented 3 years ago

Hey!! @pombredanne I want to help you with this issue. May I know what I should do about that? Give me a chance that I could understand the system and the issue.

pombredanne commented 3 years ago

@D-lang14 sorry for the late reply. The key issue is that we are putting too much emphasis on using Pygments and libmagic to detect programming langauges. A config.h should ALWAYS be reported as C/C++ So this would consists in:

  1. study the way we detect and report types in typecode
  2. in particular check out our we use a registry of extensions
  3. evolve the way we report programming languages to work first (and only?) from the registry for some cases
Mitrajit commented 3 years ago

Hello @pombredanne I have studied the typecode library thoroughly and made some changes in pygments_lexers_mapping.py in site-packages and I am getting the desired results but it seems Lib\site-packages are git ignored. So where would I make the changes such that after configuring, the package changes would be reflected? I aspire to join AboutCode in GSoC-21.

Mitrajit commented 3 years ago

@pombredanne A slight change in typecode can give the correct results by changing 'CLexer': ('typecode._vendor.pygments.lexers.c_cpp', 'C', ('c',), ('*.c', '*.h', '*.idc'), ('text/x-chdr', 'text/x-csrc')), to 'CLexer': ('typecode._vendor.pygments.lexers.c_cpp', 'C', ('c',), ('*.c', '*.h', '*.idc','config.h.in'), ('text/x-chdr', 'text/x-csrc')),