edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Implement dependency monitoring #86

Closed weatherpattern closed 6 years ago

weatherpattern commented 6 years ago

Implement services to monitoring for package updates, dependency conflicts, security alerts, etc.

Possible services: https://gemnasium.com/ https://libraries.io/ https://github.com/apps/greenkeeper

Should create separate issues for each repo?

Mr0grog commented 6 years ago

Also https://dependabot.com

Mr0grog commented 6 years ago

Forget Gemnasium—they’re shutting down May 15th.

Mr0grog commented 6 years ago

Dependabot seems to be doing a reasonably good job for DB over the last couple weeks. It lacks the great security auditing and alerting tools of Gemnasium, but the auto-pull-request workflow is super nice:

The compatibility rating feature is pretty slick, too.

Mr0grog commented 6 years ago

I forgot about this :P

We are using dependabot for -db and -versionista-scraper and it’s been pretty great. (I have it set to do batches of updates once a week, which seems like a pretty good tradeoff between manageability and freshness.)

We don’t have it set up for -ui because major dependencies (e.g. React) are way behind. I should probably set it up anyway.

We don’t have it set up for -processing because we don’t have dependency versions pinned (bad!!) so it wouldn’t do any good. We need to fix that problem there first.

danielballan commented 6 years ago

We don’t have it set up for -processing because we don’t have dependency versions pinned (bad!!) so it wouldn’t do any good. We need to fix that problem there first.

I have been trying to figure out why scientific Python projects -- even the big, important ones virtually never do this. I'm not sure I have an answer yet, but it might simply be the difference between an app and a library. I think of web-monitoring-processing as a library, a set of tools for doing analysis on our data, that is also used by our app. I think the dependency pinning should happen in the deployment layer (Ansible, Kube, whatever we use) not in the requirements.txt of the library itself. In a data analysis context, it's up to the user to manage their environment, and the library shouldn't force the user to use an old version of a dependency unless their is a specific known issue.

Mr0grog commented 6 years ago

even the big, important ones virtually never do this.

Whaaaaaaaaaaaaat 😱

it might simply be the difference between an app and a library

I still feel like a library should set ranges. Even a library will only work with versions of a dependency that have a compatible API (in both JS and Ruby, common practice is to set ranges in your dependency list, pin specific versions in your lockfile, and ensure libraries don’t ship lockfiles). But I can see how ranges might be harder to set or less relevant for a lot of Python libraries, where semver isn’t followed as religiously as in Ruby and JS.

Side note: have you looked into pipenv? It seems nice.

I think of web-monitoring-processing as a library, a set of tools for doing analysis on our data, that is also used by our app.

I’m sure I sound like a broken record at this point, but I still feel like this is too many concerns. I really do feel like these should at least be separate packages that live side-by-side in the repo (or maybe once we do edgi-govdata-archiving/web-monitoring-db#119 the diff server bits can be moved there).

In a data analysis context, it's up to the user to manage their environment, and the library shouldn't force the user to use an old version of a dependency unless their is a specific known issue.

Absolutely — hence the point above about libraries not having lockfiles and only setting acceptable ranges (which is usually easy with semver: >=[version I wrote this with],<[next major version] and common enough that it has a special symbol in many package managers: ~ or ^).

danielballan commented 6 years ago

Side note: have you looked into pipenv? It seems nice.

Have read about it, haven't used it. One downside it that it assumes it owns everything, whereas the other solutions play pretty well together, which is convenient for daily use.

I’m sure I sound like a broken record at this point, but I still feel like this is too many concerns. I really do feel like these should at least be separate packages that live side-by-side in the repo (or maybe once we do edgi-govdata-archiving/web-monitoring-db#119 the diff server bits can be moved there).

:+1: to that. I'm against two-packages-one-repo because of versioning: I like the git tag, setup.py version, and package.__version__ to be in sync, and it's hard to follow semver if you have two different packages sharing a repo. But I agree that the diffing server more naturally belongs in with the API server.

Absolutely — hence the point above about libraries not having lockfiles and only setting acceptable ranges (which is usually easy with semver: >=[version I wrote this with],<[next major version] and common enough that it has a special symbol in many package managers: ~ or ^).

Yeah, I am not sure why this isn't common practice in scientific Python. A lot of the core packages are pre-1.0. SciPy just tagged 1.0 this year, and pandas and scikit-learn are still pre-1.0, so pinning to <1 isn't enough to guard against API changes, and in practice <0.X would require all the packages to release every time some other major package updated. I get the sense that the web community has significantly faster release cycles. Anyway, all this is just a curiosity. I am on board with pinning dependencies and will self-assign this issue.

danielballan commented 6 years ago

We are using dependabot for -db and -versionista-scraper and it’s been pretty great.

We don’t have it set up for -processing because we don’t have dependency versions pinned (bad!!) so it wouldn’t do any good. We need to fix that problem there first.

I have pinned dependency versions (https://github.com/edgi-govdata-archiving/web-monitoring-processing/pull/205) and set up dependabot for -processing. Some PRs have started streaming in; it worked.

We don’t have it set up for -ui because major dependencies (e.g. React) are way behind. I should probably set it up anyway.

This is the last action item before we can close this issue.

Mr0grog commented 6 years ago

Looks like -ui is working great, so I’m closing this! Thanks @danielballan!!!!!!