☂ Pull Versions from IA for diffing

edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")

Creative Commons Attribution Share Alike 4.0 International

105 stars 17 forks source link

☂ Pull Versions from IA for diffing #23

Closed danielballan closed 5 years ago

danielballan commented 7 years ago

Useful links:

Mr0grog commented 7 years ago

Adding some steps here to make this more concrete:

[x] Send list of URLs to IA so they can start scraping more regularly
[x] Finalize code for ETL-ing IA data into web-monitoring-db
- [x] Test basic import algorithm
- [x] Make sure we are getting all the metadata we want (or at least an appropriate amount to start with)
- [x] Write code to do the loading for our list of URLs (edgi-govdata-archiving/web-monitoring-processing#86)
[x] Get this into production on an AWS or GCloud VM, running ~1/day, similar to Versionista
[x] Add health checks on Internet Archive (edgi-govdata-archiving/web-monitoring-processing#125)
[x] Finish UI diffs (edgi-govdata-archiving/web-monitoring-ui#121) - Note: this is done, but is in real need of ongoing improvement.
[x] Generate analyst CSVs from this data (edgi-govdata-archiving/web-monitoring-versionista-scraper#41)

See also edgi-govdata-archiving/web-monitoring-processing#3, which needs to be solved (at least in its simplest form) by the second checkbox above.

Updated 2017-09-25 to add UI and scraper changes.

Mr0grog commented 7 years ago

Assigning to me for now, if you are back in action before I get too far, @danielballan, feel free to steal this from me.

Mr0grog commented 7 years ago

@janakrajchadha it seems like you've moved on to analysis concerns, but let me know if I'm stepping on your toes here.

janakrajchadha commented 7 years ago

@Mr0grog Redirect me to the current deployment process? How are the versions pulled from versionista? I'm still thinking about including the processing layer in the deployment. Do you have any ideas in mind?

Mr0grog commented 7 years ago

Redirect me to the current deployment process?

I don’t know what the current process is! If you aren’t familiar, either, we’ll have to reverse-engineer and document from the existing setup on Google Cloud.

How are the versions pulled from versionista?

That is done by an entirely different tool: web-monitoring-versionista-scraper. Deployment of it is documented here: https://github.com/edgi-govdata-archiving/web-monitoring-versionista-scraper/blob/master/deployment.md

I'm still thinking about including the processing layer in the deployment. Do you have any ideas in mind?

No, I have not had a chance to think about this any more at all.

janakrajchadha commented 7 years ago

Let's think of ways to do it and have a call to discuss it? I'll be happy to take this up alone if you're already too occupied with other tasks. Perhaps I can share my ideas with you and then start working after an initial feedback from you. How does that sound @Mr0grog ?

Mr0grog commented 7 years ago

@janakrajchadha Unless you are already working ETL for Internet Archive, you should keep focused on what you’re working on.

On the subject of deployment, we should keep that as an entirely separate issue from this. (Edit: see https://github.com/edgi-govdata-archiving/web-monitoring-processing/issues/71)

janakrajchadha commented 7 years ago

I'm not sure but I think I should probably work on the ETL issue only after the completion of GSoC. Would September be too late to work on it?

Mr0grog commented 7 years ago

Yeah, I am going to try and get the majority of it done this week; don't worry about it.

janakrajchadha commented 7 years ago

I don’t know what the current process is! If you aren’t familiar, either, we’ll have to reverse-engineer and document from the existing setup on Google Cloud.

I think this is absolutely essential as we plan to add more things to the server soon and that needs to be reflected correctly on the cloud deployment.

I'm still thinking about including the processing layer in the deployment. Do you have any ideas in mind?

No, I have not had a chance to think about this any more at all.

After looking at how the versionista scraper works, I think there may be a way to incorporate the processing layer in the cron script, although it may not be as easy as it sounds here. If we're looking for processing every time new versions are added to the db, I think this is the place and the way it would fit in. Is that right @Mr0grog?

Mr0grog commented 7 years ago

If we're looking for processing every time new versions are added to the db, I think this is the place and the way it would fit in.

I think that’s well out of scope for this issue. Let’s move that discussion to #62.

Mr0grog commented 7 years ago

Updates: this is significantly farther along now (see checklist). We still need to only import pages that we want from IA, which depends on edgi-govdata-archiving/web-monitoring-db#44 and possibly edgi-govdata-archiving/web-monitoring-db#128 as well (or instead).

On the processing side, this is tracked by edgi-govdata-archiving/web-monitoring-processing#86.

After that, I think it’s just about getting this into production alongside web-monitoring-versionista-scraper.

Mr0grog commented 6 years ago

Updates:

All the DB stuff necessary is done
Processing work is in PR edgi-govdata-archiving/web-monitoring-processing#174. This needs revision based on recent discussions with IA folks, but I am aiming to be done by 2018-09-12. Ideally even deployed by then, but I’m not going to make bets on that :P
After that, we need to generate more specific analyst sheets (these all come from our DB, but filter their inputs by (source_type):
- Current Versionista-based sheets
- New IA-based sheets
- Combined sheets

Mr0grog commented 5 years ago

The final version of the relevant code is up for review in edgi-govdata-archiving/web-monitoring-processing#174.

Mr0grog commented 5 years ago

edgi-govdata-archiving/web-monitoring-processing#174 has been merged! 🎉

Closing this.