MusicConnectionMachine / StructuredData

In this project we will be scanning structured online resources such as DBPedia, Worldcat, MusicBrainz, IMSLP and other databases
GNU General Public License v3.0
4 stars 3 forks source link

Data Maintance #79

Open kordianbruck opened 7 years ago

kordianbruck commented 7 years ago

Hey guys,

one thing that might still be open: can the scraping cli command be run multiple times without inserting doubles?

We want to maintain the data in the coming 2-3 years at least. So that running the CLI multiple times should ensure that it checks if the rows are already in the database and update the entity accordingly.

We want to put this into a cronjob and run each week or month.

Thanks

sacdallago commented 7 years ago

http://docs.sequelizejs.com/en/latest/api/model/#upsertvalues-options-promisecreated might be a place to start, substituting inserts, but only if you have a logic in place to match two documents. You might eventually need to create a distance between to objects dictated by the object's fields (or features) and define a threshold by which you consider two objects similar :) but, but. For now, I would say that upsert instead of create should do, based on name & birthdate matching?