Rostlab / JS16_ProjectA

In this project we will lay the foundations for our system by integrating data from multiple sources into a central database. The database will serve the apps and the visualization tool that will be developed in other projects.
GNU General Public License v3.0
28 stars 14 forks source link

Database filler & wiki scraper / Wiki too slow / Inconsistent data / #36

Closed kordianbruck closed 8 years ago

kordianbruck commented 8 years ago

@sacdallago @gyachdav

We are currently trying to scrape data off the wiki into our database but it is awfully slow and really not a practicable method. What is the status of the database dump you promised?

In addition many data fields are not consistent across individual pages - any idea how to approach this?

Issues #5 #2 #20 are relying on a fix on this

Legenzoo commented 8 years ago

Yeah, we, group2, agree.

gyachdav commented 8 years ago

mysql dump can be fetched from https://www.dropbox.com/s/uomk7vsl94fc3b4/WoIaFDB20160228.sql.gz?dl=0

gyachdav commented 8 years ago

Did you manage to install the db? are you done populating your db?

sacdallago commented 8 years ago

Just adding to Guy's comments: we are trying really hard to get you the data also on an instance that you can access, but it's much harder than you would think. And it's not like we can't set up a VM in 2 minutes, it's all the administrative crap. But we'll keep you updated. For now get a docker image of mysql, run a local instance, copy the dump in and try to get the data from there. Sorry guys!

Legenzoo commented 8 years ago

Well, we managed to scrap the wikipages and to fill the database with houses (without references to other entities).

@kordianbruck: Please give me the credentials to the db of your server or run a GET on api/dbFiller/houses to fill the db, that you set online for us. DELETE on api/dbFiller/houses clears the collection.

We really need the help of the others on this task! This is no task for only two guys.

Especially, i have already implemented really much (All stores and controllers for the API calls, the API doc comments, worked on the models,...). Now, also doing all the scrapper and population stuff with @theocheslerean alone is really a overload for me.

kordianbruck commented 8 years ago

@Adiolis I've just updated the repo, docs and ran the endpoint.

https://got-api.bruck.me/api/houses/ already returns a good amount of them.

I'm still unsure if this should be a public API at all. If people can come at random and delete the collection or rerun the import then it might come close to a DDOS of my server. For now I've disabled the routes to be accessed from the web.

Legenzoo commented 8 years ago

I introduced now caching of the wiki scrapping results. Cachefile: wikiData/houses.json

Code is now way smoother and faster.

sacdallago commented 8 years ago

I'm very impressed :smile: I haven't had time to look at how you implemented it, but if you haven't done this: make sure that you put a TTL on the data, so that if ever it get's updated, at some stage a request will update it!

Legenzoo commented 8 years ago

@sacdallago : I will implement it =)

sacdallago commented 8 years ago

Niceee!!!!! :) :+1: @Adiolis

Legenzoo commented 8 years ago

Regions, characters, episodes and houses fillers are implemented =D

@kordianbruck Please perform the fillers on your server ;) (For the routes please look into routes.js) Characters will take really long (> 10min.), because there are > 2400 of them. Also check the new cfg property ;)

Still further properties like all references to other entitites and some that the scrapper is not yet handling have to be implemented.

sacdallago commented 8 years ago

Btw #38 just for the sake of it. And I'm gonna un-assign me, otherwise I get crazyyyyy :D :D

kordianbruck commented 8 years ago

Episodes gives me the following error:

Problem:ValidationError: CastError: Cast to Date failed for value "June 15th, 2014" at path "airDate" Problem:ValidationError: CastError: Cast to Date failed for value "April 19th, 2015" at path "airDate" Problem:ValidationError: CastError: Cast to Date failed for value "May 4th, 2015" at path "airDate" Problem:ValidationError: CastError: Cast to Date failed for value "April 12th, 2015" at path "airDate" Problem:ValidationError: CastError: Cast to Date failed for value "April 26th, 2015" at path "airDate" Problem:ValidationError: CastError: Cast to Date failed for value "May 10th, 2015" at path "airDate" Problem:ValidationError: CastError: Cast to Date failed for value "May 17th, 2015" at path "airDate"

kordianbruck commented 8 years ago

Nevermind, the characters worked now after the third try. All imports should be done now. The error above only affected a few episodes.

sacdallago commented 8 years ago

@kordianbruck it is always a good idea to put the parsing/casting of dates in a try/catch, as no one follows standards, ever :wink: it is worse to have the server crash, than have a null object in a document!

Legenzoo commented 8 years ago

airDate is newly introduced by the scrapper. The filler is not yet transforming the date into the required type. Also not ignoring it, because the property is in the model. Needs to be done.

sacdallago commented 8 years ago

@adiolis can u open up an issue for this if you haven't done so already? :)

Legenzoo commented 8 years ago

@sacdallago Done. I think, i finished the characters filling with all details.

@kordianbruck Please clear the characters collection and start filling it again ;)

gyachdav commented 8 years ago

Please load into your pub API. I wanna take a look.

Sent from my iPhone

On Mar 5, 2016, at 12:20 PM, Michael Legenc notifications@github.com wrote:

@sacdallago Done. I think, i finished the characters filling with all details.

— Reply to this email directly or view it on GitHub.

kordianbruck commented 8 years ago

@gyachdav done, just updated the server.

Legenzoo commented 8 years ago

We still need the others to help us on this task... @kordianbruck @togiberlin @boriside

theocheslerean commented 8 years ago

Wiki scraper all done