Pipeline Architecture - Githubissues

MusicConnectionMachine / RelationshipsG3

In this repository we will try to build and determine relationships between composers

GNU Affero General Public License v3.0

2 stars 4 forks source link

Pipeline Architecture #10

Closed Henni closed 7 years ago

Henni commented 7 years ago

Idea: Build our application resembling a pipeline. This would look as follows:

   get sentences from database
-> run relationship extraction
-> [classify relationships]
-> calculate page rank and reputability
-> store result in database

Notes:

sentences have to be stored in the database. Otherwise we have to add an additional step to extract them from the given URLs.
relationship classification could already be done by the relationship extraction algorithm
Page Rank and Reputability idea: factor in how much we trust that page. For example Facebook will probably return worse results than Wikipedia.

Henni commented 7 years ago

@MusicConnectionMachine/group-3 if you agree with this approach, i would persist it in the wiki and create separate issues for each step.

krishenk commented 7 years ago

@Henni looks good to me. One question, Are we going to extract the sentences or it would be done by @MusicConnectionMachine/group-1 ? Also, regarding page rank, @vviro mentioned something regarding that, in issue #5. Please have a look.

Henni commented 7 years ago

@krishenk regarding group1 see https://github.com/MusicConnectionMachine/UnstructuredData/issues/40

Regarding page rank: I completely agree with @vviro's comment https://github.com/MusicConnectionMachine/RelationshipsG3/issues/5#issuecomment-284272220 This is also why I added the term reputability (also see https://github.com/MusicConnectionMachine/RelationshipsG4/issues/9#issuecomment-283922521) We should clear up the terms page rank and reputability at the meeting tomorrow.

vviro commented 7 years ago

@Henni is it already clear what the page rank and reputability will be based on? Is the idea here to extract the URLs from the HTML and use them as links? Is the code for doing this (going from a set of html documents to their page rank) already available or easily implementable and is it clear how to run it on this dataset? (Maybe this is a wrong issue to ask this question and there is a better place...) I just wonder whether the relationship extraction step will require more attention than would be possible if also the reputability is to be addressed. A word of caution here...

Henni commented 7 years ago

@vviro Let me come back to this tomorrow. Our team will meet tomorrow morning and this is a topic I will bring up.

kordianbruck commented 7 years ago

About that page rank: I'm just gonna leave these links here for you to further scout out

Mining the pagerank in a larger scale is against the ToS of Google

RBirkeland commented 7 years ago

It seems google does not provide their pagerank API anymore, depending on the amount of pages, we might have to implement it our self.

Henni commented 7 years ago

In my opinion page rank (in whatever way) should be a topic we will handle in the future. Our next step should be to get the relation extraction going. This should already give some kind of quality indication which might already suffice.

kordianbruck commented 7 years ago

SEOstats (that ugly php script - @sacdallago right?) offers other apis in addition to the pagerank api. Thats why its in there ;)

kordianbruck commented 7 years ago

@Henni progress? done? needs work?

Henni commented 7 years ago

Let's count this one as done. The architecture itself is an ongoing process, but the decisions described in here seem to be fine with everyone.