dselivanov / rscalaKafka

Tiny R kafka client with rscala
10 stars 3 forks source link

Investigate on how to remove jar from git (rewrite each time without tracking history) #5

Open dselivanov opened 7 years ago

dselivanov commented 7 years ago

Repo is already > 50M due to tracking of fat jar history

r2evans commented 7 years ago

Not a solution to the original repo size, but some think using git-lfs is a good option. It reduces the size of clones of the repo by only downloading the latest version of the current branch, not all versions. It does require all devs on the project to install it as well.

Is it possible to separate the jar into multiple jars? The method that rkafka uses is to have one R package with rarely-changing java dependencies (rkafkajars) which holds the majority of file size, and the main package only needs to contain the one or two .class files that change frequently.

dselivanov commented 7 years ago

@r2evans thanks for suggestion. I thought about to not compile it into fat jar. Will try.

r2evans commented 7 years ago

It would be nice if, when building your R package, it could dynamically create the jar file instead of making you do it behind the scenes. I don't know how that can be done easily (without adding external-to-R dependencies and custom packaging).

If you use a second package (e.g., rscalaKafkaJars), you could use:

.onLoad <- function(libname, pkgname) {
  jars <- list.files(system.file("java", package = "rscalaKafkaJars"),
                     pattern = ".*\\.jar$", full.names = TRUE, recursive = TRUE)
  rscala::.rscalaPackage(pkgname, classpath.appendix = jars)
}

and that should provide the same functionality. That way, though the repo with the fat-jar would still be rather large, the rscalaKafka repo would not grow.

Caveat: short of "starting over with the repo", I think you'll always have a large repo (it will not shrink). One question/answer on StackOverflow has suggestions for removing large files from a commit(s), including linking to GitHub's "Removing sensitive data from a repository" and another StackOverflow q/a.