fossology / atarashi

Atarashi scans for license statements in open source software, focusing on text statistics. Designed to work stand-alone and with FOSSology.
http://fossology.github.io/atarashi
GNU General Public License v2.0
26 stars 23 forks source link

Comment extraction not working on curling quotes #21

Closed GMishx closed 6 years ago

GMishx commented 6 years ago

Curling quotes (,, , ) are not filtered in comment extractor which results in some wrong results.

A more extensive listing of problematic word characters:

Character UTF-8 ASCII Name
\u2013 - EM DASH
\u2014 - EM DASH
\u2015 - Horizontal Bar
\u2018 ' Left single quotation mark
\u2019 ' Right single quotation mark
\u201a , Single low-9 quotation mark
\u201b ' Single high-reversed-9 quotation mark
\u201c " Left double quotation mark
\u201d " Right double quotation mark
\u201e " Double low-9 quotation mark
\u2026 ... Horizontal ellipsis
\u2032 ' Prime
\u2033 " Double prime
© \u00a9 (c) Copyright sign
amanjain97 commented 6 years ago

@GMishx Thanks for the issue. Please check #22