Closed ShilpaGhanashyamGore closed 7 years ago
Ideally you would use the same parameters as Group 2: https://github.com/MusicConnectionMachine/UnstructuredData/issues/106
@kordianbruck the -d
and -t
parameters could be same, but -b
is only needed for group 2
@ShilpaGhanashyamGore tested your cli.js
on a local DB today, works good 👍
one minor thing are missing error checks (see gitter). I wanted to push a small improvement but don't have write permissions here T_T
@ShilpaGhanashyamGore one more thing:
api/api/dsap/artists.js -> findAllArtists()
: one of the attributes (picture
) does not exist as a column in the DB. At least when the DB is populated by StructuredData/cli.js
.
@nyxathid : yes , picture is not a part of the data we populate in our database. My pull request https://github.com/MusicConnectionMachine/api/pull/76 has the final schema . I will submit it by EOD .Some travis related failures is to be fixed
Hey @nyxathid, our cli.js still needs some work. Those are the issues we will solve next
@kordianbruck @sacdallago regarding the spawning of worker threads, this does not really benefit our scraping because CPU isn't going to be our limiting factor but the throttling of the webpages. So mutliple threads would only help for populating the database right? Also, our cli is going to get executed on a separate VM, so we can/should always make use of all CPU's available I guess?
@TimHenkelmann yes, there will be a limiting factor by how fast the websites allow you to access them. But you can surely run all four scrapes to each of the databases in parallel on a thread/core each. => https://github.com/MusicConnectionMachine/StructuredData/issues/64
@kordianbruck Yep, just implemented that in #67 . Threading for populating the db still needs to be done once we implemented this correctly (with entities, relatinos, etc.)
Implemented point 3 ("entry in entitites table first") in #69
Implemented point 2 ("fix musicbrainz scripts") in #60 and #72
Implemented point 3 ("adjust population step") in #73 .
How is the status on that?
Can I already populate things into the db with node cli.js -s dbpedia
?
As I see it only creates JSON files, and takes forever on my local machine...
Do you have a DB-Dump, I can import in my local DB? (just like @vviro asked in your channel)
@Sandr00 Yes, as soon as @ShilpaGhanashyamGore , @angelinrashmi2000 or @LukasNavickas reviewed, approved and merged #73 , you can use it to populate the db. At the moment, the cli does not take multiple websites as input, this still needs to be fixed. So you would need to execute dbpedia and musicbrainz separately. Worldcat script still needs to be added as in #74 Yes, the scraping takes quite some time due to throttling issues.... Unfortunately, I don't have a DB dump
@Sandr00 Just letting you know that I'm currently running the scraping, so the population should hopefully be finished by tomorrow morning... Btw, as soon as #78 is merged, the input parameters for the cli change and one can scrape multiple webpages by executing i.e.
node cli.js musicbrainz dbpedia
Thanks for the update 👍 Sounds good.
So this is done?
Closing.
Have an entry point script which takes in the source as input parameter , extracts data and populates Postgres DB(for now it shall be local db)