Closed danielballan closed 5 years ago
Adding some steps here to make this more concrete:
web-monitoring-db
See also edgi-govdata-archiving/web-monitoring-processing#3, which needs to be solved (at least in its simplest form) by the second checkbox above.
Updated 2017-09-25 to add UI and scraper changes.
Assigning to me for now, if you are back in action before I get too far, @danielballan, feel free to steal this from me.
@janakrajchadha it seems like you've moved on to analysis concerns, but let me know if I'm stepping on your toes here.
@Mr0grog Redirect me to the current deployment process? How are the versions pulled from versionista? I'm still thinking about including the processing layer in the deployment. Do you have any ideas in mind?
Redirect me to the current deployment process?
I don’t know what the current process is! If you aren’t familiar, either, we’ll have to reverse-engineer and document from the existing setup on Google Cloud.
How are the versions pulled from versionista?
That is done by an entirely different tool: web-monitoring-versionista-scraper
. Deployment of it is documented here: https://github.com/edgi-govdata-archiving/web-monitoring-versionista-scraper/blob/master/deployment.md
I'm still thinking about including the processing layer in the deployment. Do you have any ideas in mind?
No, I have not had a chance to think about this any more at all.
Let's think of ways to do it and have a call to discuss it? I'll be happy to take this up alone if you're already too occupied with other tasks. Perhaps I can share my ideas with you and then start working after an initial feedback from you. How does that sound @Mr0grog ?
@janakrajchadha Unless you are already working ETL for Internet Archive, you should keep focused on what you’re working on.
On the subject of deployment, we should keep that as an entirely separate issue from this. (Edit: see https://github.com/edgi-govdata-archiving/web-monitoring-processing/issues/71)
I'm not sure but I think I should probably work on the ETL issue only after the completion of GSoC. Would September be too late to work on it?
Yeah, I am going to try and get the majority of it done this week; don't worry about it.
I don’t know what the current process is! If you aren’t familiar, either, we’ll have to reverse-engineer and document from the existing setup on Google Cloud.
I think this is absolutely essential as we plan to add more things to the server soon and that needs to be reflected correctly on the cloud deployment.
I'm still thinking about including the processing layer in the deployment. Do you have any ideas in mind?
No, I have not had a chance to think about this any more at all.
After looking at how the versionista scraper works, I think there may be a way to incorporate the processing layer in the cron script, although it may not be as easy as it sounds here. If we're looking for processing every time new versions are added to the db, I think this is the place and the way it would fit in. Is that right @Mr0grog?
If we're looking for processing every time new versions are added to the db, I think this is the place and the way it would fit in.
I think that’s well out of scope for this issue. Let’s move that discussion to #62.
Updates: this is significantly farther along now (see checklist). We still need to only import pages that we want from IA, which depends on edgi-govdata-archiving/web-monitoring-db#44 and possibly edgi-govdata-archiving/web-monitoring-db#128 as well (or instead).
On the processing side, this is tracked by edgi-govdata-archiving/web-monitoring-processing#86.
After that, I think it’s just about getting this into production alongside web-monitoring-versionista-scraper
.
Updates:
source_type
):
The final version of the relevant code is up for review in edgi-govdata-archiving/web-monitoring-processing#174.
edgi-govdata-archiving/web-monitoring-processing#174 has been merged! 🎉
Closing this.
Useful links: