Closed Henni closed 7 years ago
It should be doable. Could you provide us a list of URLs of pages that you want to have? Something like: ["https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart", "another url", ...]
My idea would be to start at https://en.m.wikipedia.org/wiki/Category:Classical_music and scrape all pages within a depth of 2.
If you need a link list for this I'll be able to generate this later today.
Sounds good! If you do it for us, we can concentrate on the actual getting of websites ;)
Here you have a list of URLs that are roughly filtered. It has about 4000 URLs and still contains a few off-topic entries. urls.txt
Status update + Details:
I will run the code on my machine over night ans see what happens in the morning. Right now it's easier to hire a 10 year old child who will copy-paste data from wiki directly. Everything would be much nicer if a WET file would contain more relevant URLs.
Something bad happened around 5 AM, so I could only download and process 27 WET files and extract 33 pages in total: combined-wiki-data-from-27-WETs.zip
Update: after some optimization the processing runs much faster. Here are 176 pages: combined-wiki-data-from-153-WETs.zip
@Henni does this work for you? Do you need more?
@nyxathid the "~50% success rate" is caused by the common crawl api and not by my url list, right?
about 1 relevant page per WET file WHY??!
this sounds like bad luck
The output looks good to me. We'll run that through our algorithms and see what it brings up. I'll also discuss the format with my team and might give you some feedback soon.
But this issue can be closed. Any follow-up will happen in new (or other) issues.
@Henni Final update: 765 pages combined-wiki-data-from-639-WETs.zip
Looks like wiki pages are almost evenly distributed among the WET files. Now the issue is really closed :D I will concentrate on Azure now, so please try to work with this data for a while. After Azure is up you can get fresh data directly from there.
It would be great if you could give us some example data by "simply" scraping some wikipedia pages.
This data could be used by @MusicConnectionMachine/group-3 @MusicConnectionMachine/group-4 and helps us to expand the example data for the visualization groups (see https://github.com/MusicConnectionMachine/RelationshipsG3/issues/21)
Is this doable by Monday, so that we have a first functioning dataset?