Count code from other subdirectories called in runtests.jl

j-fu commented 1 year ago

Hi, great package idea !

In several packages, e.g. VoronoiFVM.jl, I run code from an examples subdirectory and also a couple of Pluto notebooks during CI. Currently these are not counted as lines of test code.

Is there any idea to handle this situation besides moving all examples to test ?
If I have Pluto notebooks containing manifests for tests - how can I prevented from "cheating" due to counting the manifests as code lines ?

ericphanson commented 1 year ago

Is there any idea to handle this situation besides moving all examples to test ?

Hm, so PackageAnalyzer only categorizes code as being in test or not in the show method; the actual Package object itself just stores a table with lines of code per file.

So one option is to just ignore what is displayed in the show method, and count the lines yourself:

julia> pkg = analyze("VoronoiFVM")
Package VoronoiFVM:
  * repo: https://github.com/j-fu/VoronoiFVM.jl.git
  * uuid: 82b139dc-5afc-11e9-35da-9b9bdfd336f3
  * version: 0.18.3
  * is reachable: true
  * tree hash: a5f4bc559684925f45104513f9abd65570be86ff
  * Julia code in `src`: 4926 lines
  * Julia code in `test`: 585 lines (10.6% of `test` + `src`)
  * documentation in `docs`: 1172 lines (19.2% of `docs` + `src`)
  * documentation in README: 10 lines
  * has license(s) in file: MIT
    * filename: LICENSE
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions

julia> PackageAnalyzer.count_julia_loc(pkg, "test") + PackageAnalyzer.count_julia_loc(pkg, "examples")
3115

However, if the goal is to communicate to other folks how much test code there is (who may not know that the code is in examples), I'm not sure what the best way to do that is. If we integrated JuliaSyntax (xref https://github.com/JuliaEcosystem/PackageAnalyzer.jl/issues/63), we could try to look at any include statements from runtests.jl and follow them. That might be the most satisfying way, and would also make sure that extraneous files in test that aren't actually run don't count.

If I have Pluto notebooks containing manifests for tests - how can I prevented from "cheating" due to counting the manifests as code lines ?

Ideally, that would be a file with language Julia and sublanguage TOML. However, if I look at all the lines of code parsed from VoronoiVFM,

julia> DataFrame(pkg.lines_of_code)
13×7 DataFrame
 Row │ directory       language  sublanguage  files  code   comments  blanks
     │ String          Symbol    Union…       Int64  Int64  Int64     Int64
─────┼───────────────────────────────────────────────────────────────────────
   1 │ pluto-examples  Julia                      9   9435       784     473
   2 │ pluto-examples  TOML                       2   1089         1     241
   3 │ src             Julia                     19   4926       300    1033
   4 │ examples        Julia                     33   2530       919     881
   5 │ test            Julia                      9    585       123     185
   6 │ test            TOML                       1     17         0       1
   7 │ docs            Julia                      1     97        24      27
   8 │ docs            TOML                       1     13         0       1
   9 │ docs            TeX                        1    318         1      59
  10 │ docs            Markdown                  14      0       719     201
  11 │ docs            Markdown  Julia            2     13         0       0
  12 │ Project.toml    TOML                       1     44         0       2
  13 │ README.md       Markdown                   1      0        10       7

I don't see any any sublanguage TOML there. I suspect that is something that would have to be improved in tokei, the program we use to count lines of code, or by switching to a different program.

j-fu commented 1 year ago

Pluto notebooks have two strings which contain the toml contents: PLUTO_MANIFEST_TOML_CONTENTS and PLUTO_PROJECT_TOML_CONTENTS. Not sure if tokei can be teached to ignore them. In particular the manifests are quite large and would skew the picture.

As for counting code in additional subdirectories I have no idea how to catch all possible corner cases in an automated way I for example scan the examples subdirectory in runtests.jl (and passed this pattern to other authors...) .

Here is what came into my mind: Would it make sense to have a configuration file in the repo giving some more info about the package structure and the semantics of some subdirectories ? Something like a possible PackageAnalyzer.toml:

[TestSubdirs]
test
examples

[SourceSubdirs]
src
assets

[DocSubdirs]
docs
examples

In my case, examples count twice - they are part of docs (via Literate.jl) and part of tests. And assets e.g. could contain javascript code.

All output created from this information beyond the standard subdirectories possibly could be marked up as additional info by the package author.

ericphanson commented 1 year ago

Hm, I think a config could make sense. Do you know if there’s any already existing formats or systems we could use?

j-fu commented 1 year ago

My suggestion is just toml :) Parser is in stdlib, syntax is simple and every package author already knows about the format.

ericphanson commented 1 year ago

Ah right, I got that, I just meant if we could opt-into an existing system for the semantics of it that might be better than inventing our own.

For example, linguist uses git-attributes files to declare certain files are in certain languages when autodetection fails: https://github.com/github/linguist/blob/master/docs/overrides.md. We could use that system as well to add a syntax for declaring certain files belong to certain categories (such a test).

ericphanson commented 1 year ago

I think using .gitattributes for this makes sense. Something like

examples/**.jl analyzer-category=test

would mean: all .jl files in examples should be assigned the "category" test. This can be coupled with git check-attr, for example if I have .gitattributes with

test/**.jl analyzer-category=test

Then in the shell, I can check particular files like

❯ git check-attr analyzer-category test/runtests.jl
test/runtests.jl: analyzer-category: test

❯ git check-attr analyzer-category src/PackageAnalyzer.jl 
src/PackageAnalyzer.jl: analyzer-category: unspecified

So then the lines_of_code table can have an additional column for "category" (maybe w/ some additional logic to determine default category from the directory), and the show method can use this category to determine what to print for lines of test code vs src code.

The nice thing about using tools like .gitattributes and git check-attributes is that they already have a well-understood syntax (basically same as .gitignore) and tooling that supports nested files and overrides.

E.g. you could have a .gitattributes at top-level in your package, and then override it with another one in some subfolder somewhere (and it would only override attributes in that subfolder).

Also, some repos might already have a .gitattributes file, so this would mean they wouldn't need an additional file. We also don't have to document the format ourselves, and can just link out to existing docs.

j-fu commented 1 year ago

Interesting - didn't know about this possibility. It seems that this might work well.

JuliaEcosystem / PackageAnalyzer.jl

Count code from other subdirectories called in runtests.jl #78