Closed Henni closed 7 years ago
@MusicConnectionMachine/group-3 if you agree with this approach, i would persist it in the wiki and create separate issues for each step.
@Henni looks good to me. One question, Are we going to extract the sentences or it would be done by @MusicConnectionMachine/group-1 ? Also, regarding page rank, @vviro mentioned something regarding that, in issue #5. Please have a look.
@krishenk regarding group1 see https://github.com/MusicConnectionMachine/UnstructuredData/issues/40
Regarding page rank: I completely agree with @vviro's comment https://github.com/MusicConnectionMachine/RelationshipsG3/issues/5#issuecomment-284272220 This is also why I added the term reputability (also see https://github.com/MusicConnectionMachine/RelationshipsG4/issues/9#issuecomment-283922521) We should clear up the terms page rank and reputability at the meeting tomorrow.
@Henni is it already clear what the page rank and reputability will be based on? Is the idea here to extract the URLs from the HTML and use them as links? Is the code for doing this (going from a set of html documents to their page rank) already available or easily implementable and is it clear how to run it on this dataset? (Maybe this is a wrong issue to ask this question and there is a better place...) I just wonder whether the relationship extraction step will require more attention than would be possible if also the reputability is to be addressed. A word of caution here...
@vviro Let me come back to this tomorrow. Our team will meet tomorrow morning and this is a topic I will bring up.
About that page rank: I'm just gonna leave these links here for you to further scout out
Mining the pagerank in a larger scale is against the ToS of Google
It seems google does not provide their pagerank API anymore, depending on the amount of pages, we might have to implement it our self.
In my opinion page rank (in whatever way) should be a topic we will handle in the future. Our next step should be to get the relation extraction going. This should already give some kind of quality indication which might already suffice.
SEOstats (that ugly php script - @sacdallago right?) offers other apis in addition to the pagerank api. Thats why its in there ;)
@Henni progress? done? needs work?
Let's count this one as done. The architecture itself is an ongoing process, but the decisions described in here seem to be fine with everyone.
Idea: Build our application resembling a pipeline. This would look as follows:
Notes: