Fix data consistency - Githubissues

lukasstreit commented 7 years ago

What does this guy want?

When working with your data I noticed several issues. So far we (group 2) are mostly filtering out bad entries in your data but there's some stuff that seems like it could easily be fixed and save other groups from a lot of trouble.

The problems

1. Duplicate terms There are currently lots of duplicates in the db, in particular for instruments. This was already addressed in #75.

*2. -tar and in instruments** Those two terms occur in instruments.. These are just the ones I saw, there might be more nonsense terms like that. Would be great if you guys could filter for that.

3. Lists of instruments in instruments Instrument contains entries that are basically lists of instruments, such as "Synthesizers, drum machine, electric violin, keyboards, guitar, steel guitar, Transicord". Maybe you could check for commas?

4. "dbr:" Prefix in works Example: "dbr:Mirrorcle World" There are lots of terms like that in the works table, an easy fix would be to filter for this. Maybe this is just for one particular data source?

Priorisation

Sorry for packing this much stuff in one issue but I wanted to share the things that I noticed somewhere. We are going to start our large processing run very soon, but it would be great if you guys could address some of these problems. For data quality and processing performance in our group it would be good if points 3 and 4 could be fixed ASAP. Useless terms just slow down our computation which basically translates to unnecessary monetary cost. Other groups would most likely benefit from some fixes here, too.

TimHenkelmann commented 7 years ago

Hi and sorry for the late reply, I was out of town. Just fixed all named issues in #88 . Regarding point 2.: "tar" is an instrument , therefore I only filtered the "*" character for now.

kordianbruck commented 7 years ago

@TimHenkelmann so this is done then? How do we update the data to include those fixes?

TimHenkelmann commented 7 years ago

Yep this is done! But I'm not sure how to update Group1s data without interfering with the data that the other teams added...

kordianbruck commented 7 years ago

@sacdallago ideas?

sacdallago commented 7 years ago

@kordianbruck not really. There's no mechanism for upserting stuff as of now, so this is 💩 . BUT, AFAIK the prod db has not been populated with unstructured data nor relationships data, right @MusicConnectionMachine/group-2 , @MusicConnectionMachine/group-3 and @MusicConnectionMachine/group-4 ? Please give an answer soon, so we can think of repopulating. In which case:

dump the db (mcmprod AND MAKE SURE IT'S MCMPROD and not MCM**!!!!!!).
Populate the db from scratch running all scrapers.

PLEASE @TimHenkelmann MAKE THIS HAPPEN AFTER MY PR IS MERGED!

TimHenkelmann commented 7 years ago

sooo... what are the next steps on this one?

kordianbruck commented 7 years ago

So maybe we will just make a new mcmprod called mcmproduction and rescrape it into there. We have to import the new data from G2 anyways, so might as well do this step again? @gyachdav @pfent

gyachdav commented 7 years ago

Not sure I understand what needs to be decided here. G2 already processed the data based on the list provided by G1. We're not going to re run G2 pipeline again. So what are we doing here?

Sent from my iPhone

On May 2, 2017, at 2:26 PM, Kordian Bruck notifications@github.com wrote:

So maybe we will just make a new mcmprod called mcmproduction and rescrape it into there. We have to import the new data from G2 anyways, so might as well do this step again? @gyachdav @pfent

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

felixschorer commented 7 years ago

Our data links to the entities in the mcmprod DB via a foreign key... so... @kordianbruck

kordianbruck commented 7 years ago

Ok, then just leave it at that. In case we need to rerun the scraping, the modifications will be in. No need to mess around with this now at this point.

MusicConnectionMachine / StructuredData

Fix data consistency #86

What does this guy want?

The problems

Priorisation