Closed FredericoCoelhoNunes closed 3 years ago
This has been merged. Can it be closed @hcastilho, @buedaswag @FredericoCoelhoNunes ?
Any idea on how to address Find a way to persist this process across batches
Hi @cimendes can be closed for me, I think the dev-ops team already has a plan regarding that last point. Let's see what they say!
Description
The Data Sources BLU is about extracting information from databases/APIs and scraping websites. Due to its nature, it contains a lot of tasks related to provisioning infrastructure and hosting services.
This issue contains a checklist of infrastructure that needs to be copied over from last year, as well as some new things I would like to try to reduce the volatility of this BLU. The main idea is to not have to rely on any external website/API and host everything ourselves, since there have been some issues with public webpages/APIs not working/changing layouts.
Checklist
Details
MoviesDb
This is the standard database we have used in the learning notebook for a few batches. Purely illustrative, no reason to change it. Last year's credentials were the following:
FifaDb
This is the database we have been using for exercises in the last couple of batches. I find that there's enough data there that we can re-use it infinitely, just changing the questions, which is much simpler than having to find or create a new database every batch (and converting it to Postgres and SQLLite). Here are last year's credentials:
Learning API
For the learning API we have been using https://punkapi.com/ , which has proved to be quite stable. However, after last year's problem with the exercise API (it stopped working 2 days before the delivery date), I think it would be a good idea to replace this with our own version.
I still don't have anything ready and since this is the most stable API, I will leave it for last.
Pokemon API
For the exercises' API we were using a Magic: The Gathering API, which stopped working very close to the exercises' delivery date, which forced us to do some last minute changes to the exercises and the process. To keep this from happening again, I developed a very simple Pokemon API, wrote the specification on Swagger and exported it + a Flask code template. It should be basically ready to host, might require at most 1 iteration to improve documentation (if QA thinks it's necessary) and fix some issues.
Things we have to decide are:
LDSA IMDB Website
For the learning notebook, we have been scraping IMDB to find some "missing data" on the MoviesDb that we had used previously. This is pretty good because it gives a sense of cohesion to the BLU, and it's fun because the students' actually get to scrape an actual website rather than a local HTML file.
However, the design of the IMDB website changes every year, and it is often designed in weird ways that are complicated to scrape and to explain (last year was even worse than this year). There is also the risk that the website changes during the learning period, and it's quite a bit of work to rediscover how to scrape the website, take all the necessary screenshots, re-write the learning notebook, etc.
I propose we create our own simplified version of the IMDB page. It doesn't have to actually look like IMDB, but it would be nice if it didn't look totally awful (had at least a few images). I haven't had the chance to develop it yet, I will inform you when I do.
Things to decide:
Exercises Website - Bork Pawson's page
Same as above. We have been scraping quite a nice website (basically a fake book repository whose only purpose is to be scraped) but I feel like we shouldn't rely on it, and should try creating our own thing. It can be quite simple as there are only 2 exercises that use it.
Developed a very simple page for scraping.
Find a way to persist this process across batches
Finally, I think we should try to find a way to somehow "attach" the resulting infrastructure/credentials to the BLU itself, or to its release procedure, so that next batch this whole process is at least semi-automated. This will be particularly helpful if the instructor changes.
Sorry for bringing so much work! But I feel like it will save us a lot of time in the long run :) Thanks!