MusicConnectionMachine / Relationships

GNU Affero General Public License v3.0
9 stars 1 forks source link

Sample Data from Group-2 #27

Closed ansjin closed 7 years ago

ansjin commented 7 years ago

We parsed the sample WET file provided by you here https://github.com/MusicConnectionMachine/UnstructuredData/issues/58.

The file looks something like this:

Chicago Reader | Articles & Archives | flammable
Switch to the mobile version of this page.
Newsletters
Follow us
Twitter
Facebook
RSS
Mobile
Username
View Profile
Edit Profile
Log Out
Log in
Create Account
The Chicago Reader
News & Politics
Music
Arts & Culture
Film
Food & Drink
Classifieds
Browse News & Politics
News & Politics home page
Anne Ford | Chicagoans
John Greenfield | Transportation
Ben Joravsky | Politics
Michael Miner | Media
Dan Savage | Savage Love
Browse Music
Music home page
Gossip Wolf
........ 
Annie Zaleski
Year
Select a year
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
.................................and so on (complete file attached)!!

This WET file contains a lot of user comments and unnecessary data not the real data of any composer. This data looks mostly from blogs.

How can we find relationships or any timeline events from this file ?

If you have any new data extracted, can you share it with us?@MusicConnectionMachine/group-2 Preferably the scrapped Wikipedia page of a composer!!

@vviro our algorithms cannot detect anything from this data. The output of one algorithm from this text is :

{ start: '1', end: '2014', event: 'Sort by YearTitleType Order AscDesc Displaying 1 - 1 of 1Export 1 results:BibTexTaggedXML 2014 H.' }, { start: '2014', event: '709-736, 2014.' } ] example-output.txt

vviro commented 7 years ago

@ansjin are all files in the provided WET like that? How many percent are different/better? Or is it just 100% rubbish? The presence of rubbish is not surprising, but it should not prevent us from extracting meaningful relations from acceptable sources.

vviro commented 7 years ago

The point is, even if 95% of the input texts is rubbish, this shouldn't matter - the remaining 5% should deliver enough signal for meaningful results, which can be later separated from the rubbish relations (e.g. using simple counting).

ansjin commented 7 years ago

@vviro Yeah the file is like that. It looks like its from some blogs which has lots of comments and unnecessary stuff which is Kinda unrelated for algorithms. Please see attached WET file input.txt

I will in some time post the complete output of our algorithm here and see if any meaning full data exists or not . As algorithm cannot run on our PC, so I am hosting it on Azure and run.

vviro commented 7 years ago

@ansjin So the claim is that this WET file contains only relevant texts? If so, maybe the filter needs to be tweaked to exclude different classes of obviously irrelevant pages (if they are relatively easy to identify).

ansjin commented 7 years ago

@vviro yeah we need to build or find a filter which can exclude such pages or I think it would be better that we don't have such websites scrapped and included in WET files. Better thing is the filter can be added while scrapping websites only.

nbasargin commented 7 years ago

765 scrapped wiki pages can be found here: https://github.com/MusicConnectionMachine/UnstructuredData/issues/65 I hope it helps ;)

We (group 2) are stating to test the complete pipeline. As soon as the DB is up and running we can provide more data.

ansjin commented 7 years ago

Check the CSV file here https://github.com/MusicConnectionMachine/api/issues/22#issuecomment-292573343

ansjin commented 7 years ago

Given to us by group2. So this issue can be closed.