aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.11k stars 545 forks source link

add a copyright year value to JSON output #1185

Open jhgoebbert opened 6 years ago

jhgoebbert commented 6 years ago

In a CI scenario one might want to check if the copyright has the year at which the last commit changed the file. For that Scancode-toolkit could provide extra "copyright year values" in the JSON output (from_year, to_year).

Simple example for a pre-commit hook to check the year: http://damien.lespiau.name/2013/01/a-git-pre-commit-hook-to-check-year-of.html

pombredanne commented 6 years ago

@jhgoebbert good idea! I could see two different ways to do this:

  1. Since returning years as part of copyright detection internally has been removed recently (it was not used anywhere), you could add it back and update the API accordingly. This may likely be a tad involved and touches of a several internals

  2. You could create a plugin that would work on the detected copyright an do a simpler detection of years only (which may be simpler since if would be working only from a copyright statement without any noise)

In all cases I think having a new command line option (e.g. --copyright-year) to return this would be best.

jhgoebbert commented 6 years ago

Thanks for your comment. I did not know, that the scancode could return the year once.

Just to show, how the copyright year could be used, here a version 0.1 of my CI-script, which checks if each source code files has the correct year of its last commit:

#!/bin/bash
source $(cd "$(dirname "$0")"; pwd -P)/ci_funcs.sh

# create badge
if [[ ${ERR} == 0 ]]; then
   create_badge "${BADGE_FILENAME}" license-check "passed" --color=green
   pushbadge_exit "${BADGE_FILENAME}" 0
fi
create_badge "${BADGE_FILENAME}" license-check "${ERR}" --color=red
pushbadge_exit "${BADGE_FILENAME}" 1

# settings
CHECKDIRS='src,test'
CHECKTYPES='.*\.(cpp|h)$'

REF_COPYRIGHT_HOLDERS="Forschungszentrum Juelich GmbH,Juelich Supercomputing Centre.,"
REF_LICENSE_KEY="bsd-new"
REF_LICENSE_SCORE=98.0

# loop over all directorys to check
for cdir in ${CHECKDIRS//,/ }; do
  [[ ! -d ${cdir} ]] && echo "${cdir} does not exists" && continue
  echo "... checking ${cdir}"

  # collect copyright & license information
  json_allfiles=$(scancode -c -l -p -e --quiet --format json ${cdir})
  [[ $? -ne 0 ]] && ERR=$(($ERR+1)) && continue

  # iterate over all files of git repository
  filelist=$(git ls-tree -r --name-only HEAD:${cdir})
  while read filename; do
    if [[ ${filename} =~ ${CHECKTYPES} ]]; then

      filepath=${cdir}/${filename}
      echo "   ... checking ${filepath}"

      # extract license & copyright information of $filepath
      json_file=$(echo ${json_allfiles} | jq -r  '. | {files} | .[] | .[] | select(.path == "'${filepath}'")')
      [[ $? -ne 0 ]] && ERR=$(($ERR+1)) && continue

      ##### check copyright #####

      c_statement0=$(echo $json_file | jq -r '. | {copyrights}[] | .[0] | {statements}[][]')
      [[ $? -ne 0 ]] && ERR=$(($ERR+1)) && continue

      # check copyright year
      c_toyear0=$(echo ${c_statement0} | grep  -oE "[0-9]{4}" | tail -n 1)
      mod_year=$(git log -1 --format="%ad" --date=short -- ${filepath} | head -c 4)
      [[ $? -ne 0 ]] && ERR=$(($ERR+1)) && continue

      if [ "${c_toyear0}" != "${mod_year}" ]; then ERR=$(($ERR+1))
        echo "          ${c_statement0}"
        echo "          last-mod-year:${mod_year}_!=_copyright-year:${c_toyear0}_"
      fi

      # check copyright holder
      c_holders0=$(echo $json_file | jq -r '. | {copyrights}[] | .[0] | {holders}[][]' |tr '\n' ',')
      [[ $? -ne 0 ]] && ERR=$(($ERR+1)) && continue

      if [ "$c_holders0" != "${REF_COPYRIGHT_HOLDERS}" ]; then ERR=$(($ERR+1))
        echo "          ${c_statement0}"
        echo "          ref-copyright-holder:${REF_COPYRIGHT_HOLDERS}_!=_copyright-holder:${c_holders0}"
      fi

      ##### check license #####

      # check license key
      l_key0=$(echo $json_file | jq -r '. | {licenses}[] | .[0] | {key}[]')
      [[ $? -ne 0 ]] && ERR=$(($ERR+1)) && continue

      if [ "$l_key0" != "${REF_LICENSE_KEY}" ]; then ERR=$(($ERR+1))
        echo "          ref-license-key:${REF_LICENSE_KEY}_!=_license-key:${l_key0}"
      fi

      # check license score
      l_score0=$(echo $json_file | jq -r '. | {licenses}[] | .[0] | {score}[]' | bc)
      [[ $? -ne 0 ]] && ERR=$(($ERR+1)) && continue

      if [ "${l_score0%%.*}" -lt "${REF_LICENSE_SCORE%%.*}" ]; then ERR=$(($ERR+1))
        echo "          ref-license-score:${REF_LICENSE_SCORE%%.*}_>_license-score:${l_score0%%.*}"
      fi

    fi
  done <<< "$filelist"
done

# create badge
if [[ ${ERR} == 0 ]]; then
   create_badge "${BADGE_FILENAME}" license-check "passed" --color=green
   pushbadge_exit "${BADGE_FILENAME}" 0
fi
create_badge "${BADGE_FILENAME}" license-check "${ERR}" --color=red
pushbadge_exit "${BADGE_FILENAME}" 1
pombredanne commented 6 years ago

Very nice! :+1: let me see how I could get something started: do you know some Python and would feel like working on a PR?

0xc0170 commented 2 years ago

We discussed if it would be possible for scancode to capture copyright year range - how I found this issue.

This would be useful to check if the files in the merge requests updated also the copyright year - the case is people usually forget this and just keep the line as it is.

I assume the simple check would be needed if we can get this data from scancode: if the current calendar year not in the copyright year range, report an error.

pombredanne commented 2 years ago

@0xc0170 The year and year ranges are collected internally here: https://github.com/nexB/scancode-toolkit/blob/28578ad660a7145eb954c4f2e2f1bc29a94ff787/src/cluecode/copyrights.py#L85 but is then discarded afterwards. It would be possible to optionally keep it and return it as a separate attribute in the CopyrightDetection and may be the HolderDetection objects in https://github.com/nexB/scancode-toolkit/blob/28578ad660a7145eb954c4f2e2f1bc29a94ff787/src/cluecode/copyrights.py#L437 in something called year_ranges that could be exposed in the results.