codemeta / codemeta.github.io

CodeMeta web site
https://codemeta.github.io
1 stars 18 forks source link

Multiple downloads of the crosswalk table #29

Open progval opened 4 years ago

progval commented 4 years ago

Hi,

Every time one runs blogdown::build_site(), https://github.com/codemeta/codemeta/raw/master/crosswalk.csv is downloaded 16 times. After several builds in a short time, Github rate-limits these requests, which fails the build.

Do you know if there is a way to make blogdown cache the crosstable across page builds?

cboettig commented 4 years ago

@ProgVal Yes, thanks for the ping. Simple question many answers:

Thanks so much for all your work and contributions, it's really fantastic!

progval commented 4 years ago
  • For a simple fix, just rebuild the site with serve_site() instead of build_site() will tell blogdown not to re-render the .Rmd files in content dir, but instead stick with the already-knitted html outputs from them.

Excellent!

  • second, yeah, we could avoid having each of the crosswalk pages doing it's own download, or we could have R cache those downloads (e.g. by wrapping the download URL in pins::pin() or manually caching a copy), in https://github.com/codemeta/codemeta.github.io/blob/hugo/content/crosswalk/datacite.Rmd#L14

  • Zooming out, the whole design here probably needs an overhaul. As you probably know, haven't kept up with manually adding a new .Rmd for each new source column in crosswalk, so those crosswalks really aren't complete any more. Would love your opinion on this. Arguably it is quite useful to have a page with stuff like more background on maven or whatnot, but also this Rmd approach clearly doesn't scale super well. @mbjones and I were just discussing this in the context of a larger overhaul for codemeta.github.io website that would strip it down to something more minimal that is easier to maintain and keep current. The site today feels a bit bloated and stale to me, and not all that user friendly.

You could also use Travis (or any other CI) to automatically build the branch with the HTML (currently master): https://docs.travis-ci.com/user/deployment/pages/ (it won't automatically rebuild on changes of crosswalk.csv, but you could set up a daily rebuild) from just the .Rmd files; and remove .md files from the hugo branch (which might need to be renamed; maybe rename it to master and rename the current master to gh-pages)

This way, humans never have to commit generated code.

Regarding the crosswalk, we could add a single script that generates them all, from a single input file. That would also mostly solve the multiple downloads issue (there'd be only this script and terms.Rmd)

  • Related to the last is the fact that codemeta is now really two somewhat separate projects - while we set out primarily to create a crosswalk, we now basically maintain a 'new' standard and set of supporting tools, and rather separately maintain a list of crosswalk tables from other standards (largely without a lot of supporting tools except for some special cases like R, where codemetar crosswalks a lot more terms than are listed in the R crosswalk table anyway).

Even though they don't support many package-manager/language metadata formats, AFAIK Bolognese would accept contributions in that direction.

I also wrote a tool running at Software Heritage that converts several formats to CodeMeta and stores it in our database. Its reach is limited by most languages using a script in lieu of a metadata file, and we don't want to run arbitrary code on our infrastructure (though parsing with regexps seems to work in most cases, I just didn't spend much time on it).

Some ideas on how to proceed with these two pieces (e.g. should we omit or move the crosswalk stuff off of the main codemeta website?) would be helpful.

With Travis auto-building the website, most of this would no longer be a problem. We could also make https://github.com/codemeta/codemeta a git submodule of https://github.com/codemeta/codemeta.github.io and have the build process pull the local file, which would spare downloads at build time

Thanks so much for all your work and contributions, it's really fantastic!

You're welcome :) Thanks for your all work as well!

cboettig commented 4 years ago

You could also use Travis (or any other CI) to automatically build the branch with the HTML (currently master): https://docs.travis-ci.com/user/deployment/pages/ (it won't automatically rebuild on changes of crosswalk.csv, but you could set up a daily rebuild) from just the .Rmd files; and remove .md files from the hugo branch (which might need to be renamed; maybe rename it to master and rename the current master to gh-pages)

Yes, this totally should be done. It would be easiest to do so with the existing GitHub Actions script for blogdown: https://github.com/r-lib/actions/blob/master/examples/blogdown.yaml . This would avoid the extra faffing with credentials you need to do this in travis. A PR would be great for this, I'm juggling too many things to do this anytime soon!

Re crosswalk scripts -- yeah, definitely makes sense to automate that more, contributions welcome there too!! Though the crosswalk tables we have lack important metadata about "what" a given column actually is: a link to a homepage, an icon, a title and a description would be a big help.

Re translation, linking more of those tools would be a great addition.

Thanks again !

progval commented 4 years ago

Unfortunately I'm going to be busy with another project too, but I'll keep this issue in mind