Create the CLI to extract data and populate local DB - Githubissues

MusicConnectionMachine / StructuredData

In this project we will be scanning structured online resources such as DBPedia, Worldcat, MusicBrainz, IMSLP and other databases

GNU General Public License v3.0

4 stars 3 forks source link

Create the CLI to extract data and populate local DB #49

Closed ShilpaGhanashyamGore closed 7 years ago

ShilpaGhanashyamGore commented 7 years ago

Have an entry point script which takes in the source as input parameter , extracts data and populates Postgres DB(for now it shall be local db)

kordianbruck commented 7 years ago

Ideally you would use the same parameters as Group 2: https://github.com/MusicConnectionMachine/UnstructuredData/issues/106

nbasargin commented 7 years ago

@kordianbruck the -d and -t parameters could be same, but -b is only needed for group 2

@ShilpaGhanashyamGore tested your cli.js on a local DB today, works good 👍
one minor thing are missing error checks (see gitter). I wanted to push a small improvement but don't have write permissions here T_T

nbasargin commented 7 years ago

@ShilpaGhanashyamGore one more thing: api/api/dsap/artists.js -> findAllArtists(): one of the attributes (picture) does not exist as a column in the DB. At least when the DB is populated by StructuredData/cli.js.

ShilpaGhanashyamGore commented 7 years ago

@nyxathid : yes , picture is not a part of the data we populate in our database. My pull request https://github.com/MusicConnectionMachine/api/pull/76 has the final schema . I will submit it by EOD .Some travis related failures is to be fixed

TimHenkelmann commented 7 years ago

Hey @nyxathid, our cli.js still needs some work. Those are the issues we will solve next

[x] adjust the input parameters
[x] fix musicbrainz script
[x] when populating the db, first create an entry in the entities table, then an entry in the artists/works/releases/instruments table and insert the previously created entity id there as foreign key
[x] adjust step 3 (populating db) to create the correct relations between the tables (i.e. composer has one work: First create an entry in the entities table, then create an entry in the artists table and populate the entity_id column. Then, check for the work in the worktable. If it does not exist yet, create an entry in the entities table. Then create an entry in the work table and populate the entity_id column. Then link the composer and the work with the ArtistComposedWork)

TimHenkelmann commented 7 years ago

@kordianbruck @sacdallago regarding the spawning of worker threads, this does not really benefit our scraping because CPU isn't going to be our limiting factor but the throttling of the webpages. So mutliple threads would only help for populating the database right? Also, our cli is going to get executed on a separate VM, so we can/should always make use of all CPU's available I guess?

kordianbruck commented 7 years ago

@TimHenkelmann yes, there will be a limiting factor by how fast the websites allow you to access them. But you can surely run all four scrapes to each of the databases in parallel on a thread/core each. => https://github.com/MusicConnectionMachine/StructuredData/issues/64

TimHenkelmann commented 7 years ago

@kordianbruck Yep, just implemented that in #67 . Threading for populating the db still needs to be done once we implemented this correctly (with entities, relatinos, etc.)

TimHenkelmann commented 7 years ago

Implemented point 3 ("entry in entitites table first") in #69

TimHenkelmann commented 7 years ago

Implemented point 2 ("fix musicbrainz scripts") in #60 and #72

TimHenkelmann commented 7 years ago

Implemented point 3 ("adjust population step") in #73 .

Sandr0x00 commented 7 years ago

How is the status on that? Can I already populate things into the db with node cli.js -s dbpedia? As I see it only creates JSON files, and takes forever on my local machine... Do you have a DB-Dump, I can import in my local DB? (just like @vviro asked in your channel)

TimHenkelmann commented 7 years ago

@Sandr00 Yes, as soon as @ShilpaGhanashyamGore , @angelinrashmi2000 or @LukasNavickas reviewed, approved and merged #73 , you can use it to populate the db. At the moment, the cli does not take multiple websites as input, this still needs to be fixed. So you would need to execute dbpedia and musicbrainz separately. Worldcat script still needs to be added as in #74 Yes, the scraping takes quite some time due to throttling issues.... Unfortunately, I don't have a DB dump

TimHenkelmann commented 7 years ago

@Sandr00 Just letting you know that I'm currently running the scraping, so the population should hopefully be finished by tomorrow morning... Btw, as soon as #78 is merged, the input parameters for the cli change and one can scrape multiple webpages by executing i.e.

node cli.js musicbrainz dbpedia

Sandr0x00 commented 7 years ago

Thanks for the update 👍 Sounds good.

kordianbruck commented 7 years ago

So this is done?

Closing.