LLNL / scraper

Python library for getting metadata from source code hosting tools
MIT License
49 stars 23 forks source link

Do not redirect `stderr` to `stdout` when executing commands #73

Closed mcdonnnj closed 1 year ago

mcdonnnj commented 1 year ago

Update the execute() helper function to no longer pipe stderr to stdout when executing commands.

If cloc has an issue processing a file it will output an error message to stderr, and with stderr and stdout combined this breaks the ability to read the JSON output on stdout. Since cloc is designed to process text files formatted for human consumption it will often error when processing minified JavaScript files of enough size, as the timeout for processing is based on the number of lines in a file. The error message(s) for such files then breaks the ability to get information about the rest of the project.

On the main branch:

2023-03-15 11:47:36,546 - INFO: Connected to: https://github.com
2023-03-15 11:47:36,977 - INFO: Processing: cisagov/cset
2023-03-15 11:49:44,728 - ERROR: Error Decoding: url=https://github.com/cisagov/cset.git, out=
3 errors:
Line count, exceeded timeout:  /tmp/tmph8eqj0h3/clone-dir/CSETWebApi/CSETWeb_Api/CSETWeb_ApiCore/Diagram/src/main/webapp/js/app.min.js
Line count, exceeded timeout:  /tmp/tmph8eqj0h3/clone-dir/CSETWebApi/CSETWeb_Api/CSETWeb_ApiCore/Diagram/src/main/webapp/math/extensions/a11y/mathjax-sre.js
Line count, exceeded timeout:  /tmp/tmph8eqj0h3/clone-dir/CSETWebNg/src/app/models/xmlCompletionItemProvider.model.ts
{"header" : {
  "cloc_url"           : "github.com/AlDanial/cloc",
  "cloc_version"       : "1.94",
<omitted>
"SUM": {
  "blank": 104009,
  "comment": 101199,
  "code": 1661696,
  "nFiles": 5391} }

2023-03-15 11:49:45,427 - INFO: SLOC: 0
2023-03-15 11:49:45,427 - INFO: labor_hours: 0
2023-03-15 11:49:46,045 - INFO: Number of Projects: 1
2023-03-15 11:49:46,047 - INFO: Writing output to: code.json

On this branch:

2023-03-15 12:01:56,303 - INFO: Connected to: https://github.com
2023-03-15 12:01:56,725 - INFO: Processing: cisagov/cset
2023-03-15 12:04:23,443 - WARNING: Error encountered while analyzing: url=https://github.com/cisagov/cset.git stderr=
3 errors:
Line count, exceeded timeout:  /tmp/tmpc_di791i/clone-dir/CSETWebApi/CSETWeb_Api/CSETWeb_ApiCore/Diagram/src/main/webapp/js/app.min.js
Line count, exceeded timeout:  /tmp/tmpc_di791i/clone-dir/CSETWebApi/CSETWeb_Api/CSETWeb_ApiCore/Diagram/src/main/webapp/math/extensions/a11y/mathjax-sre.js
Line count, exceeded timeout:  /tmp/tmpc_di791i/clone-dir/CSETWebNg/src/app/models/xmlCompletionItemProvider.model.ts

2023-03-15 12:04:24,025 - INFO: SLOC: 1661696
2023-03-15 12:04:24,025 - INFO: labor_hours: 1555362
2023-03-15 12:04:24,468 - INFO: Number of Projects: 1
2023-03-15 12:04:24,468 - INFO: Writing output to: code.json
IanLee1521 commented 1 year ago

This looks good to me, thanks!

One question, in the case where there is an error, what ends up in out ? Should there be any sort of "make sure the cloc value is something reasonable, 0 maybe?"

mcdonnnj commented 1 year ago

This looks good to me, thanks!

One question, in the case where there is an error, what ends up in out ? Should there be any sort of "make sure the cloc value is something reasonable, 0 maybe?"

So if cloc crashed completely then it should have a non-0 return code which is handled by https://github.com/LLNL/scraper/blob/e19717f00578068345ac683210f2154977ed1e66/scraper/util.py#L29-L34

In the case of no output (no files found but cloc ran successfully) then it would get caught by https://github.com/LLNL/scraper/blob/e19717f00578068345ac683210f2154977ed1e66/scraper/util.py#L148-L150

Otherwise it should still produce output. As an example I created a temporary directory that only contains the <cset repository root>/CSETWebApi/CSETWeb_Api/CSETWeb_ApiCore/Diagram/src/main/webapp/js/app.min.js file from the PR description.

Directory contents:

$ ls /tmp/scraper_testing/
app.min.js

Running cloc as the scraper package would:

$ cloc --json /tmp/scraper_testing/

1 error:
Line count, exceeded timeout:  /tmp/scraper_testing/app.min.js
{"header" : {
  "cloc_url"           : "github.com/AlDanial/cloc",
  "cloc_version"       : "1.94",
  "elapsed_seconds"    : 12.0404329299927,
  "n_files"            : 1,
  "n_lines"            : 12136,
  "files_per_second"   : 0.0830534919977008,
  "lines_per_second"   : 1007.9371788841},
"JavaScript" :{
  "nFiles": 1,
  "blank": 1,
  "comment": 12135,
  "code": 0},
"SUM": {
  "blank": 1,
  "comment": 12135,
  "code": 0,
  "nFiles": 1} }

Saving stdout and stderr separately:

$ cloc --json /tmp/scraper_testing/ 2>testing.err 1>testing.out
$ cat testing.out
{"header" : {
  "cloc_url"           : "github.com/AlDanial/cloc",
  "cloc_version"       : "1.94",
  "elapsed_seconds"    : 12.0349020957947,
  "n_files"            : 1,
  "n_lines"            : 12136,
  "files_per_second"   : 0.0830916605752387,
  "lines_per_second"   : 1008.4003927411},
"JavaScript" :{
  "nFiles": 1,
  "blank": 1,
  "comment": 12135,
  "code": 0},
"SUM": {
  "blank": 1,
  "comment": 12135,
  "code": 0,
  "nFiles": 1} }
$ cat testing.err

1 error:
Line count, exceeded timeout:  /tmp/scraper_testing/app.min.js

Does that address your concerns?

IanLee1521 commented 1 year ago

Yup, thanks!