Merck / pkglite

Compact Package Representations
https://merck.github.io/pkglite/
GNU General Public License v3.0
30 stars 4 forks source link

Expand file extension dictionary #31

Closed nanxstats closed 2 years ago

nanxstats commented 2 years ago

This PR extends the file extension dictionary to include commonly used files with extensions like .stan thus fixes #20 .

Impact

Now you should be able to use the regular collate(..., file_auto("inst/")) and collate(..., file_root_core()) calls and see the .stan files and configuration files defined by use_rstan() collated and tagged as text files.

Metrics

I calculated the coverage percentage by number of files in all source packages on CRAN (data):

x <- readLines("exts.txt")
x <- tolower(unlist(strsplit(x, split = "\t")))
y <- sort(table(x), decreasing = TRUE)
eoi <- y

df <- data.frame(
  "ext" = names(eoi),
  "mime" = mime::guess_type(paste0(".", names(eoi))),
  "count" = as.vector(eoi)
)

ext_pkglite <- unique(tolower(c(pkglite::ext_text(flat = TRUE), pkglite::ext_binary(flat = TRUE))))
ext_pkglite <- ext_pkglite[!is.na(match(ext_pkglite, df$ext))]

sum(df[match(ext_pkglite, df$ext), "count"]) / sum(df$count)

Before patch: 88.85%. After patch: 96.65%.

Next step

A more fundamental fix for such issues is separating file capture rules and file type tagging rules, to make the former NOT file extension-based and much more generic (via updating the current file spec definitions), and the latter universal (via dictionary + tagging all unknown extensions as binary). This will be done in issue https://github.com/Merck/pkglite/issues/18.