Canadian Digital Service changes

buckley-w-david commented 6 years ago

As per https://github.com/18F/pulse/pull/774 let's discuss some changes.

I'm going to just spew out some of the ideas/plans we've been discussing over here, with the understanding that most/all of it is still in very early stages and not really solidified. The scope of work implications of each piece has only been briefly considered, so it may be some of these are just too much.

The CLI of the data module seems a bit unintuitive and difficult to understand from simply looking at the code, with handling of different options/arguments spread throughout the codebase. We'd like to consolidate it into one central CLI handler section that is explicit in all the arguments it supports and their effect. Something to the effect of the WIP PR https://github.com/cds-snc/pulse/pull/3/files#diff-bb1a6896d9e62542cd89ee8daa479277 in our fork.
Going hand in hand with that is modularizing the code in update.py and processing.py, as right now it's pretty densely packed into run methods that do a lot. In that same WIP PR I've broken things up a bit so that different functionality can be run independently of the others.
No specific plans for it as of right now, but we'd like to eliminate shell_out entirely. A python implemented alternative for how curl is used (likely using something like the requests module) seems like it would fit the bill, and once domain-scan is turned into a python package (Which I see there is a PR for right now 👍 👍 👍 from us on that idea), it can be listed as a dependency and called directly instead. I think some AWS stuff is also done via shell_out to a command line tool, I've not done much research but I would be surprised if there was not an existing python library for such a use case.
As of right now, there's a decent chunk of Pulse that's not in scope for our work (primarily the "analytics" section), at least in this first pass, we're concerned with HTTPS, and SSL. As such we'll likely be stripping the analytics section (along with the a11y stuff) out entirely.
As you would expect, our needs for domain lists will be different than yours were/are, as such our equivalent to your meta.yml file will have to be different, along with the options passed along to domain-scan. If we go the pulse-lite approach, it might be worthwhile to look into abstracting some of the URL functionality, to avoid other users of Pulse needing to hard code their own URLs. This could be helped by more extreme modularization of the existing steps into more pure functions, that way lists of domains could just be fed into the functions that actually operate on them instead of those functions fetching the domains from specific locations via things like the meta.yml file or environment variables. Ultimately the domains could still bet fetched from a file or whatever is a good choice, just not fetched in the functions that do the operating.
I know this is approaching the realm of very large changes, but the thought had crossed my mind many times that I would love if there didn’t have to be so many intermediate steps of writing data to disk. I would like the database that feeds pulse to be the only place that data sits at rest. That probably has huge implications with how caching of expensive operations are currently handled.
We’re unlikely to stay with the decision to use TinyDB to feed pulse, I haven’t put much thought into it but something like sqlite seems appropriate.
I’m not sure to what extent, but I know the designer on our team is working on some changes to the visual look.

As to the idea of a shared core, some kind of pulse lite, that sounds like a good idea to me. Do you have any existing ideas in that vein?

Now after writing all that, I realize that there is a lot. Don’t feel like you need to respond specifically to all of it, this is mostly to inform you as to what we’re thinking, to inform you as to the discussion of a shared base.

buckley-w-david commented 6 years ago

@evadb is also dev with CDS working on this project, so you are aware

evadb commented 6 years ago

Hello friends o/

konklone commented 6 years ago

Hello friends o/

Hello!

The CLI of the data module seems a bit unintuitive and difficult to understand from simply looking at the code, with handling of different options/arguments spread throughout the codebase. We'd like to consolidate it into one central CLI handler section that is explicit in all the arguments it supports and their effect. Something to the effect of the WIP PR https://github.com/cds-snc/pulse/pull/3/files#diff-bb1a6896d9e62542cd89ee8daa479277 in our fork.

That would be very useful. It has evolved a bit haphazardly, and is also very tough to debug against. For example, testing out one piece of the pipeline while mocking out the prior parts leading up to it, without having to wait minutes/hours for things to complete, is difficult (and is what motivated some of the haphazard flags, to try to cut through that).

Going hand in hand with that is modularizing the code in update.py and processing.py, as right now it's pretty densely packed into run methods that do a lot. In that same WIP PR I've broken things up a bit so that different functionality can be run independently of the others.

That makes sense. And this backend data loading ETL part of Pulse is really quite separate from the front-end. You could theoretically separate the web front-end and data loading back-end relatively cleanly (which might be a useful mental frame for a "pulse lite" structure).

No specific plans for it as of right now, but we'd like to eliminate shell_out entirely. A python implemented alternative for how curl is used (likely using something like the requests module)

:+1: :+1:

...and once domain-scan is turned into a python package (Which I see there is a PR for right now 👍 👍 👍 from us on that idea), it can be listed as a dependency and called directly instead.

:+1: :+1:

I think some AWS stuff is also done via shell_out to a command line tool, I've not done much research but I would be surprised if there was not an existing python library for such a use case.

Yeah, this could be boto.

As of right now, there's a decent chunk of Pulse that's not in scope for our work (primarily the "analytics" section), at least in this first pass, we're concerned with HTTPS, and SSL. As such we'll likely be stripping the analytics section (along with the a11y stuff) out entirely.

An ideal outcome here would be a data/processing.py that didn't have to carve out specific code for each type of thing being monitored. There is a bit of super-core topic-specific code (the threshold-setting for the HTTPS/HSTS/TLS fields) that could be moved elsewhere, and ideally data/processing.py (or whatever it's broken into) would become agnostic to the subject of the data it's processing.

As you would expect, our needs for domain lists will be different than yours were/are, as such our equivalent to your meta.yml file will have to be different, along with the options passed along to domain-scan. If we go the pulse-lite approach, it might be worthwhile to look into abstracting some of the URL functionality, to avoid other users of Pulse needing to hard code their own URLs. This could be helped by more extreme modularization of the existing steps into more pure functions, that way lists of domains could just be fed into the functions instead of the functions fetching the domains from specific locations via things like the meta.yml file or environment variables.

Yes. One way to start might be just replacing meta.yml in the repo with meta.yml.example and expecting that the meta.yml for each instance will be unversioned. But it's likely better for PaaS-friendly hosting (e.g. Cloud Foundry) to move those out to environment variables instead.

I know this is approaching the realm of very large changes, but the thought had crossed my mind many times that I would love if there didn’t have to be so many intermediate steps of writing data to disk. I would like the database that feeds pulse to be the only place that data sits at rest. That probably has huge implications with how caching of expensive operations are currently handled.

Do you mean eliminating the on-disk output of domain-scan as well? Having the files be somewhere is very helpful for debugging. I reference the S3 buckets we store raw results in all the time:

https://s3-us-gov-west-1.amazonaws.com/cg-4adefb86-dadb-4ecf-be3e-f1c7b4f6d084/live/subdomains/scan/cache/pshtt/dap.digitalgov.gov.json

But one area where domain-scan would benefit is having a small opinionated option to have its --output flag point to an S3 bucket instead of local disk, which (though it would be some amount slower) could enable domain-scan to run without needing any local disk storage.

We’re unlikely to stay with the decision to use TinyDB to feed pulse, I haven’t put much thought into it but something like sqlite seems appropriate.

I regret choosing TinyDB - it's slow and more cumbersome than I expected. If I could go back, I'd pick SQLite or some other small thing.

I will say that having sub-objects inside documents/rows is very helpful, and potentially extra useful for a "pulse lite", where topic-specific fields don't need to be carved out in a database migration. That was part of the original motivation in choosing a schema-less database. I believe Postgres also supports things like that these days too, but that may be heavier infrastructure than either of us wants.

I’m not sure to what extent, but I know the designer on our team is working on some changes to the visual look.

I would generally expect our visual presentations to diverge somewhat, at the bare minimum for branding purposes. What would be nice to figure out upfront is if we can at least keep the innards as a shared upstream.

As to the idea of a shared core, some kind of pulse lite, that sounds like a good idea to me. Do you have any existing ideas in that vein?

Two possible places to split the app come to mind:

Split the back-end (starting from "domain-scan output" and ending in "database loaded") from the front-end (the entire website that uses the database). So basically the DB would be the intermediary, and both front-end and back-end would have to agree on the DB schema and coordinate changes. This gives maximum flexibility on customizing the web front-end, and makes the core conceptually simpler and smaller (and a clear CLI-only interface). However, it forgoes sharing any front-end code, and creates a potentially awkward junction point in a shared DB schema/tool.
The Pulse core contains both front-end and back-end components, but the front-end is customized per-instance on which topics are being measured, and allows substantial display customizations per-instance. There's probably some idea of "white labeling" here that adds some complexity to the front-end system and potentially restricts overall layout/theme customization, with the benefit of sharing as much code as possible, and limiting the divergence principally to front-end rendering code and some other HTML/CSS.
There are more ways of doing it! Other ideas?

I think any of these likely means un-hardcoding out the topics being measured and have those be generatable from separately managed configuration. So there'd be some concept of an "HTTPS" topic that specifies what fields are expected from where, how to translate those input fields into display fields, and how to render those display fields effectively. This would strip HTTPS-specific code from the data module and put it somewhere it could be pulled from by the back-end and the front-end.

That's a lot on my end too. :) I'm all for just taking this in easy steps, but if there's any particularly compelling way to find a core we can all use, I'm all ears and ready to support.

buckley-w-david commented 6 years ago

Man, this should be like 5 separate conversations in different places to allow for responses that don't take 20+ minutes to write.

I'll be very brief, it's likely each of these sections can (should) be it's own conversation

CLI: How dependent on that exact interface are you guys? There are really three problems I have with it.

Everything is highly coupled, it's difficult to do one thing without having to do others as well.
Handling of arguments is dispersed throughout the codebase
It's not following standard unix style arguments. This one is probably less important, but I really like the classic command --option value --other_option other_value, whereas you guys have command --option=value --other_option=other_value.

The changes I'd want to do to fix those (mostly the third one) would break the interface.

🎉 Down with shell_out 🎉

👍 Generalizing processing.py - The model that domain-scan has for run-time configurable scanners and gatherers is a classic, it could likely be mapped fairly easily into "processors".

I will say that having sub-objects inside documents/rows is very helpful,...

I do like me some document databases. I've only used Mongo, but I agree that they are very handy.

Do you mean eliminating the on-disk output of domain-scan as well?

Yeah, I'll shelve my feelings on it for now, but I feel like there's got to be a better way (some logging type stuff to handle debugging or something). Small steps first.

As a gut reaction I like completely splitting the backend ETL from the frontend entirely, but I can the downside of repeating ourselves a bit.

konklone commented 6 years ago

It's not following standard unix style arguments. This one is probably less important, but I really like the classic command --option value --other_option other_value, whereas you guys have command --option=value --other_option=other_value. The changes I'd want to do to fix those (mostly the third one) would break the interface.

Standard unix style arguments would be preferable, and we can break the interface to do it.

Generalizing processing.py - The model that domain-scan has for run-time configurable scanners and gatherers is a classic, it could likely be mapped fairly easily into "processors".

That's the ticket.

Do you mean eliminating the on-disk output of domain-scan as well?

Yeah, I'll shelve my feelings on it for now, but I feel like there's got to be a better way (some logging type stuff to handle debugging or something). Small steps first.

Would extending domain-scan to support output directly to S3 instead of local disk address this? Because there's a very clear argument to do that, as filed in https://github.com/18F/domain-scan/issues/197.

I'm good breaking these into separate conversations, feel free to close this and we can advance any of them individually!

buckley-w-david commented 6 years ago

Sounds good, I will close this issue, and likely open new ones for individual topics soon (sometime today)

18F / pulse

Canadian Digital Service changes #775