Open chrismattmann opened 9 years ago
hrm, I would add a label, but it won't let me sorry about that.
I like the general concept of this session. Not only to produce code, but to produce a large, thematic dataset as well. Do you have a feel for if it would be mainly data crawling and downloading or mainly data formatting and organization? Also, do you have a list of target datasets that need to be incorporated and linked? (beyond just the technologies used)
I think it would be both crawling and formatting/organization @lawinslow . Thanks for the comments. As for the datasets, yes, we would get some sub-crawl of ACADIS (~3k datasets), AMD (similar order) and ADE (~20 K datasets). Atmospheric/polar, surface reflectance, ice sheet mass balance, and potentially other parameters that would lend themselves towards nice visualizations circa the workshop in November 2014:
http://nsf.gov/awardsearch/showAward?AWD_ID=1445624&HistoricalAwards=false
The overarching aim of this session is to further enable and clarify data access for the above community at large. Current observations show that Polar data is hosted and served from a number of agencies and from a number of archives e.g NASA’s Global Change Master Directory [0], NSF’s ACADIS Gateway [1], Data at the NSIDC, etc. It is therefore an outcome that the different entities maintaining individual storage hubs leads to barriers for accessing and searching efficiently across or between datasets. This is due to inconsistency in the structure and quality of underlying datasets themselves and within the systems which expose the data. It is noted that some of the above systems do not permit query models which allow time-series-like queries. This increases the difficulty of grouping similar data based around certain points in time, places, or events. This workshop should provide an opportunity to
To this end we propose to
[0] http://gcmd.gsfc.nasa.gov/
[1] https://www.aoncadis.org/home.htm
[3] htp://tika.apache.org
[4] http://www.iana.org/assignments/media-types/media-types.xhtml
Good stuff coming together here! Can't wait to be a virtual participant!
This sounds like fun! I don't have experience with the above tools, but I am familiar with tools like BeautifulSoup for Python, as well as numerous tools for spatial data extraction/transformation.
This sounds great folks!
I was thinking of a possible high-level agenda - thoughts?
Proposed Workshop Agenda
yes, @riverma agenda looks awesome! @jczaplew you're welcome to participate and would love to have you hacking around and about!
also thanks @abburgess would love to have you with virtual part!
FYI, URLs for crawling:
NSF ACADIS:https://www.aoncadis.org/home.htm
NSIDC Arctic Data Explorer: http://nsidc.org/acadis/search/
@chrismattmann will get on this ASAP and begin crawling above domains , thanks. Would be great to get datasets developed ASAP. Is it worth us storing the datasets on AWS or something? Do you have an idea about creating individual datasets for each 'domain' above or do you want to have them lumped into one dataset?
Lewis, any way we can do this crawl tutorial style over video? I'd love to participate.
Annie
On Tue, Sep 2, 2014 at 10:33 AM, Lewis John McGibbney < notifications@github.com> wrote:
@chrismattmann https://github.com/chrismattmann will get on this ASAP and begin crawling above domains , thanks. Would be great to get datasets developed ASAP. Is it worth us storing the datasets on AWS or something? Do you have an idea about creating individual datasets for each 'domain' above or do you want to have them lumped into one dataset?
— Reply to this email directly or view it on GitHub https://github.com/NCEAS/open-science-codefest/issues/26#issuecomment-54196858 .
Ann Bryant Burgess, PhD
Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA
Alaska Science Center/USGS Anchorage, AK
Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burgess@gmail.com
Hey @abburgess, I'm looking into it right now
Folks there is a etherpad for this track here https://etherpad.mozilla.org/PolarCyberInfra Please feel free to hack away on that. I'm currently filling out some structure
@lewismc let's have them each in different datasets, then I would love to put it up online at e.g., an Amazon instance that we can spin up and down.
ACK +1
Crawl of NSIDC Arctic Data Explorer going: http://pastebin.com/BWKwuBV6
Pastebin: http://pastebin.com/36uu3mAp
Link to presentation given at Open Science Codefest: https://docs.google.com/presentation/d/1wLF1crJrFQANGxa27e6ZkpjQ50QdtMgvZgMdyS4pmvM/edit?usp=sharing
Great work @riverma and @lewismc and thanks to everyone for participating in the session!
Some pictures of us having a great time at the codefest: https://www.flickr.com/photos/skolr/15129576171/
Organizational Page: DataCrawl
Crawl ACADIS, AMD and ADE to prepare a dataset for participants to hack on in the upcoming NSF DataViz Hackathon for Polar CyberInfrastructure in NYC in November 2014:
http://nsf.gov/awardsearch/showAward?AWD_ID=1445624&HistoricalAwards=false
Participants would use real world data science tools like Tika (http://tika.apache.org/), Nutch (http://nutch.apache.org/), Solr (http://lucene.apache.org/solr/) and OODT (http://oodt.apache.org/) to crawl and prepare the datasets of interesting Polar parameters for Visualization experts to then hack on during a 2 day NSF visualization hackathon in NYC in November. Be part of doing something real, contributing to Apache projects (and getting the merit and potentially becoming a committer and PMC member yourself) and also contributing to NSF and NASA goals!