MusicConnectionMachine / RelationshipsG4

In this repository we will try to build and determine relationships between composers
GNU Affero General Public License v3.0
0 stars 3 forks source link

How do we rank links? #9

Open Sandr0x00 opened 7 years ago

Sandr0x00 commented 7 years ago

What are trustworthy links? Why do we trust some links more than other ones? Who says that musicbrainz.org is more trusted than mymusicblog.wordpress.com? Do we say that? (For thousands of links?)

sacdallago commented 7 years ago

The correct term here is reputability.

  1. https://zeltser.com/lookup-malicious-websites/
  2. https://www.edb.utexas.edu/petrosino/Legacy_Cycle/mf_jm/Challenge%201/website%20reliable.pdf
  3. http://journalism.about.com/od/reporting/a/Eight-Ways-To-Tell-If-A-Website-Is-Reliable.htm
pfent commented 7 years ago

One of the most teached algorithms is HITS, where you have an "authority" value. Though not strictly a "trustworthyness" value, but might be an indicator for it

simonzachau commented 7 years ago

Maybe "an indicator for it" is the best we can get. How big would our focused subgraph then be for the HITS algorithm? E.g.: Pretend we want to score the following relationship only: (Beethoven - inspired by - Mozart) -> the sources stored at 'Beethoven' and 'Mozart' by Project A could be the limited root set. However, when we judge a source as a whole, are all sources in the database our root set? Consequently, we would then need to scrape 1 level deeper for the base set?

pfent commented 7 years ago

If we're using an optimized library (NetworkX is pretty good, but that's Python…), with all that power iteration eigenvector calculation magic, we can probably go pretty large with the focused subgraph, maybe even just use all relevant sources. But hard to tell without any measurements

@sacdallago maybe you know a Javascript library similar to NetworkX?

sacdallago commented 7 years ago

@pfent I unfortunately don't :( But some NPM digging might make pretty things surface. One week ago I found two groups attempting to write CNNs in JS, so I'm fairly sure there's a package for everything :D :D

simonzachau commented 7 years ago

Our current idea:

  1. get all links in database (root set)
  2. scrape outgoing links (for base set)
  3. generate network of all database links
  4. HITS: our plan is to try to connect NetworkX to Nodejs
vviro commented 7 years ago

Regarding the ranking see also my comment at https://github.com/MusicConnectionMachine/RelationshipsG3/issues/5#issuecomment-284272220 This would be a cool thing to try out, but it seems to me that ideally the time to approach this would be when we see that we really need this refinement and we need to get to this place first.

sacdallago commented 7 years ago

@simonzachau Spawning child processes and assign them jobs with other languages is always a bit of an overkill! Avoid that as much as possible, and really just do that if there is no other way.

FelsyWaschbaer commented 7 years ago

Maybe we can use one of these: https://github.com/graphology/graphology-hits, https://www.npmjs.com/package/ngraph.hits

see experimental branch

sacdallago commented 7 years ago

https://www.npmjs.com/package/graphology-hits was last published 2 weeks ago. What that tells me is that there is someone trying to do something similar and hasn't found a solution either, and that the package is being maintained (as opposed to the year old one).

simonzachau commented 7 years ago

@sacdallago thank you for reviewing our findings! That's why we also opted to try graphology-hits rather than the unmaintained ngraph.hits.