Sample Data - Githubissues

MusicConnectionMachine / UnstructuredData

In this project we will be scanning unstructured online resources such as the common crawl data set

GNU General Public License v3.0

3 stars 1 forks source link

Sample Data #65

Closed Henni closed 7 years ago

Henni commented 7 years ago

It would be great if you could give us some example data by "simply" scraping some wikipedia pages.

This data could be used by @MusicConnectionMachine/group-3 @MusicConnectionMachine/group-4 and helps us to expand the example data for the visualization groups (see https://github.com/MusicConnectionMachine/RelationshipsG3/issues/21)

Is this doable by Monday, so that we have a first functioning dataset?

nbasargin commented 7 years ago

It should be doable. Could you provide us a list of URLs of pages that you want to have? Something like: ["https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart", "another url", ...]

Henni commented 7 years ago

My idea would be to start at https://en.m.wikipedia.org/wiki/Category:Classical_music and scrape all pages within a depth of 2.

If you need a link list for this I'll be able to generate this later today.

nbasargin commented 7 years ago

Sounds good! If you do it for us, we can concentrate on the actual getting of websites ;)

Henni commented 7 years ago

Here you have a list of URLs that are roughly filtered. It has about 4000 URLs and still contains a few off-topic entries. urls.txt

nbasargin commented 7 years ago

Status update + Details:

The CommonCrawl Index API can resolve a specific (wiki) URL and link to the WET file that contains the page with URL. However, that API is very slow. I have to kill 50% of the request after 10 sec timeout.
So far I have resolved about 400 of the provided URLs with ~50% success rate
400x10sec = ~1hour
The resulting number of WET files that store those pages is 153 (about 1 relevant page per WET file WHY??!)
An average WET file is about 150 MB in size (compressed) and takes 9 to 14 minutes to download, unpack and process on my machine
170 files x 10 min = ~29 hours

I will run the code on my machine over night ans see what happens in the morning. Right now it's easier to hire a 10 year old child who will copy-paste data from wiki directly. Everything would be much nicer if a WET file would contain more relevant URLs.

nbasargin commented 7 years ago

Something bad happened around 5 AM, so I could only download and process 27 WET files and extract 33 pages in total: combined-wiki-data-from-27-WETs.zip

nbasargin commented 7 years ago

Update: after some optimization the processing runs much faster. Here are 176 pages: combined-wiki-data-from-153-WETs.zip

kordianbruck commented 7 years ago

@Henni does this work for you? Do you need more?

Henni commented 7 years ago

@nyxathid the "~50% success rate" is caused by the common crawl api and not by my url list, right?

about 1 relevant page per WET file WHY??!

this sounds like bad luck

The output looks good to me. We'll run that through our algorithms and see what it brings up. I'll also discuss the format with my team and might give you some feedback soon.

But this issue can be closed. Any follow-up will happen in new (or other) issues.

nbasargin commented 7 years ago

@Henni Final update: 765 pages combined-wiki-data-from-639-WETs.zip

Looks like wiki pages are almost evenly distributed among the WET files. Now the issue is really closed :D I will concentrate on Azure now, so please try to work with this data for a while. After Azure is up you can get fresh data directly from there.