NCEAS / open-science-codefest

Web site and planning materials for open science conference.
http://nceas.github.io/open-science-codefest
12 stars 10 forks source link

Crawl and prepare NSF ACADIS, NASA AMD and NSIDC Arctic Data Explorer datasets with Apache Tika, Nutch, Solr, and OODT #26

Open chrismattmann opened 9 years ago

chrismattmann commented 9 years ago

Organizational Page: DataCrawl

Crawl ACADIS, AMD and ADE to prepare a dataset for participants to hack on in the upcoming NSF DataViz Hackathon for Polar CyberInfrastructure in NYC in November 2014:

http://nsf.gov/awardsearch/showAward?AWD_ID=1445624&HistoricalAwards=false

Participants would use real world data science tools like Tika (http://tika.apache.org/), Nutch (http://nutch.apache.org/), Solr (http://lucene.apache.org/solr/) and OODT (http://oodt.apache.org/) to crawl and prepare the datasets of interesting Polar parameters for Visualization experts to then hack on during a 2 day NSF visualization hackathon in NYC in November. Be part of doing something real, contributing to Apache projects (and getting the merit and potentially becoming a committer and PMC member yourself) and also contributing to NSF and NASA goals!

chrismattmann commented 9 years ago

hrm, I would add a label, but it won't let me sorry about that.

lawinslow commented 9 years ago

I like the general concept of this session. Not only to produce code, but to produce a large, thematic dataset as well. Do you have a feel for if it would be mainly data crawling and downloading or mainly data formatting and organization? Also, do you have a list of target datasets that need to be incorporated and linked? (beyond just the technologies used)

chrismattmann commented 9 years ago

I think it would be both crawling and formatting/organization @lawinslow . Thanks for the comments. As for the datasets, yes, we would get some sub-crawl of ACADIS (~3k datasets), AMD (similar order) and ADE (~20 K datasets). Atmospheric/polar, surface reflectance, ice sheet mass balance, and potentially other parameters that would lend themselves towards nice visualizations circa the workshop in November 2014:

http://nsf.gov/awardsearch/showAward?AWD_ID=1445624&HistoricalAwards=false

lewismc commented 9 years ago

The overarching aim of this session is to further enable and clarify data access for the above community at large. Current observations show that Polar data is hosted and served from a number of agencies and from a number of archives e.g NASA’s Global Change Master Directory [0], NSF’s ACADIS Gateway [1], Data at the NSIDC, etc. It is therefore an outcome that the different entities maintaining individual storage hubs leads to barriers for accessing and searching efficiently across or between datasets. This is due to inconsistency in the structure and quality of underlying datasets themselves and within the systems which expose the data. It is noted that some of the above systems do not permit query models which allow time-series-like queries. This increases the difficulty of grouping similar data based around certain points in time, places, or events. This workshop should provide an opportunity to

To this end we propose to

[0] http://gcmd.gsfc.nasa.gov/

[1] https://www.aoncadis.org/home.htm

[2] http://nsidc.org/data/

[3] htp://tika.apache.org

[4] http://www.iana.org/assignments/media-types/media-types.xhtml

[5] https://svn.apache.org/repos/asf/oodt/trunk/filemgr/src/main/java/org/apache/oodt/cas/filemgr/structs/Product.java

esip-lab commented 9 years ago

Good stuff coming together here! Can't wait to be a virtual participant!

jczaplew commented 9 years ago

This sounds like fun! I don't have experience with the above tools, but I am familiar with tools like BeautifulSoup for Python, as well as numerous tools for spatial data extraction/transformation.

riverma commented 9 years ago

This sounds great folks!

I was thinking of a possible high-level agenda - thoughts?

Proposed Workshop Agenda

chrismattmann commented 9 years ago

yes, @riverma agenda looks awesome! @jczaplew you're welcome to participate and would love to have you hacking around and about!

chrismattmann commented 9 years ago

also thanks @abburgess would love to have you with virtual part!

chrismattmann commented 9 years ago

FYI, URLs for crawling:

NASA AMD: http://gcmd.gsfc.nasa.gov/KeywordSearch/Keywords.do?Portal=amd&KeywordPath=Parameters%7CCRYOSPHERE&MetadataType=0&lbnode=mdlb2

NSF ACADIS:https://www.aoncadis.org/home.htm

NSIDC Arctic Data Explorer: http://nsidc.org/acadis/search/

lewismc commented 9 years ago

@chrismattmann will get on this ASAP and begin crawling above domains , thanks. Would be great to get datasets developed ASAP. Is it worth us storing the datasets on AWS or something? Do you have an idea about creating individual datasets for each 'domain' above or do you want to have them lumped into one dataset?

esip-lab commented 9 years ago

Lewis, any way we can do this crawl tutorial style over video? I'd love to participate.

Annie

On Tue, Sep 2, 2014 at 10:33 AM, Lewis John McGibbney < notifications@github.com> wrote:

@chrismattmann https://github.com/chrismattmann will get on this ASAP and begin crawling above domains , thanks. Would be great to get datasets developed ASAP. Is it worth us storing the datasets on AWS or something? Do you have an idea about creating individual datasets for each 'domain' above or do you want to have them lumped into one dataset?

— Reply to this email directly or view it on GitHub https://github.com/NCEAS/open-science-codefest/issues/26#issuecomment-54196858 .


Ann Bryant Burgess, PhD

Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA

Alaska Science Center/USGS Anchorage, AK

Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burgess@gmail.com

Office Address: 4210 University Dr., Anchorage, AK 99508-4626

lewismc commented 9 years ago

Hey @abburgess, I'm looking into it right now

lewismc commented 9 years ago

Folks there is a etherpad for this track here https://etherpad.mozilla.org/PolarCyberInfra Please feel free to hack away on that. I'm currently filling out some structure

chrismattmann commented 9 years ago

@lewismc let's have them each in different datasets, then I would love to put it up online at e.g., an Amazon instance that we can spin up and down.

lewismc commented 9 years ago

ACK +1

chrismattmann commented 9 years ago

Crawl of NSIDC Arctic Data Explorer going: http://pastebin.com/BWKwuBV6

chrismattmann commented 9 years ago

Pastebin: http://pastebin.com/36uu3mAp

riverma commented 9 years ago

Link to presentation given at Open Science Codefest: https://docs.google.com/presentation/d/1wLF1crJrFQANGxa27e6ZkpjQ50QdtMgvZgMdyS4pmvM/edit?usp=sharing

chrismattmann commented 9 years ago

Great work @riverma and @lewismc and thanks to everyone for participating in the session!

chrismattmann commented 9 years ago

Some pictures of us having a great time at the codefest: https://www.flickr.com/photos/skolr/15129576171/