Japan-Digital-Archives / Japan-Digital-Archive

Digital Archive of Japan's 2011 Disasters
6 stars 18 forks source link

Content :: Import JapanFocus Data #712

Open ebensing opened 10 years ago

ebensing commented 10 years ago

The API is ready

The Requests need to be sent here

http://japanfocus.org/site/list_xml

We accept the following parameters : 'start' - start number 'limit' - request limit 'start_id' - id bigger than 'end_id' -id smaller than 'publication_date_start' - start publication date 'publication_date_end' -end publication date

The Parameters have to be sent over POST or GET.

System response is delivered in XML format

corinnecurcie commented 9 years ago

Okay so you'll probably have to write a script in python to do this. I like to use the python library scrapy. You can see an example of a scrapy script I wrote here. If you look at the JSON string that it creates, in context of the Adding Data to the Archive on our wiki (very last section), then it will hopefully make a bit of sense. Google scrapy for all the documentation, it's pretty accessible. You're pulling data out of a page, and then going to the next page, and repeat.

So you use that URL from EJ's original comment, plus parameters, to get a list of specific articles. For example, http://japanfocus.org/site/list_xml?start_id=0&end_id=4070 gives you all the articles with ids 0 to 4070, but unfortunately it doesn't show all of them. If you download scrapy (it's a python library), then open a python shell and run scrapy shell "http://japanfocus.org/site/list_xml?start_id=0&end_id=4070" (you need quotes around the url now because of the symbols), and then type len(sel.xpath('//element')), you can see that it only outputs 30 at a time. Servers often impose limits like this because they don't want to get stuck returning 10,000 articles if that's what you ask for. So your script will have to "crawl," since we can't get all the items on one page.

It looks like the IDs only go up to ~4200, since http://japanfocus.org/site/list_xml?start_id=4300&end_id=500000 gives you no articles. And it seems like there aren't really any articles with IDs less than 1500 - see http://japanfocus.org/site/list_xml?start_id=0&end_id=1500 just gives info about JapanFocus.

Crawling is what my script does on the very last line of the xml_spider.py in the first link I gave you: yield Request(nextFileLink, callback = self.parse). I made my nextFileLink with resumption token stuff specific to that particular API, but for yours, you can probably just build the next file link by increasing the start_id and the end_id in the url both by 30. So you can start with http://japanfocus.org/site/list_xml?start_id=1500&end_id=1530, (or 1530 and 1560 since it doesnt look like theres anything between 1500-1530) and increase both by 30 each time, and you terminate when you're out of items.

When you right-click and inspect element, you will see the HTML tags that specify what each piece of the article is, and these tags are what you'll want to use in scrapy, so that you can put the right pieces of data in the right places in your ultimate JSON string. Also, you'll want to make sure your browser is displaying the text in UTF-8 encoding for the Japanese characters to display correctly.