NSF-Polar-Cyberinfrastructure / datavis-hackathon

http://nsf-polar-cyberinfrastructure.github.io/datavis-hackathon
42 stars 11 forks source link

ETL 101 - Bringing Polar data analytics one step closer to the Polar Scientist #14

Open lewismc opened 10 years ago

lewismc commented 10 years ago

Whilst in deep conversation with a fellow ETL guru (@MBoustani), I had a brain wave and I am now flushing it out here.

I would like this session to focus on

Personally, I would really like to explore whether a Pig or Hive module for Gora would be valuable here. In order to determine this however I need to learn more about the data analysis tasks we gather in 1 above.

curtislisle commented 10 years ago

I notice that Gora supports persistence to mongoDB as an option. This could serve as a bridge between analytics that use Hadoop/Pig/Hive and analytics or visualizations that are not based on the Hadoop et. al architecture.

lewismc commented 10 years ago

Hi @curtislisle certainly. This is exactly the type of thing we can focus on. I would image we could design and implement an ETL process that would map one dataset to many datastores. We can then augment the analytic tools which have access to the data regardless of its location. Thanks for the feedback.

allenpope commented 10 years ago

As a goal - and I'm really not sure if this is the sort of data/analytics you're thinking about (I know Landsat, not these acronyms), but it could be to have a map & regional wordcloud of available polar data and/or publications...

lewismc commented 10 years ago

Yes we could certainly work towards this no doubt. Out of curiosity, what value do you think map/regional word clouds would bring? Personally, I think they would be very cool to present at the end of the workshop as a point of reflection. Any ideas?

On Fri, Oct 17, 2014 at 2:55 PM, Allen Pope notifications@github.com wrote:

As a goal - and I'm really not sure if this is the sort of data/analytics you're thinking about (I know Landsat, not these acronyms), but it could be to have a map & regional wordcloud of available polar data and/or publications...

— Reply to this email directly or view it on GitHub https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/14#issuecomment-59581685 .

Lewis

allenpope commented 10 years ago

A map / word clouds could be a good way to investigate what sorts of science have been done in a region - for example finding out that I'm interested in a glacier in a region that only has limnologists working there. Or to find what other datasets/published research are available in the region. So for data and research discovery this sort of tool could be useful, I think. Also great if you're getting into a new study region and want to get the lay of the land.

chrismattmann commented 10 years ago

word cloud sounds like a great idea, @allenpope It's also something that Apache Tika can really excel at. See: http://baron.pagemewhen.com/item/84/

allenpope commented 10 years ago

@chrismattmann - yes! But is there something to build wordclouds interactively (e.g. trace a polygon on the map) rather than dumpling into wordle, etc.?

chrismattmann commented 10 years ago

Hey @allenpope not directly in Tika, but we could develop something at the workshop that combines Tika and datavis interactively - that would be awesome! I smell another session (with @allenpope as the proposer)? +1

smskiles commented 10 years ago

"A map / word clouds could be a good way to investigate what sorts of science have been done in a region" This is such a great and useful idea- not only for getting into a new study region, or to see if science has been done/ data is available in your current study region that you aren't aware of- but, if done in an accessible way it would also be great for casual/citizen scientist and students who are interested in the information, but either don't know how/where to conduct literature searches or find them daunting.

chrismattmann commented 10 years ago

OK this sounds like we are centering on some concrete goals for this session (or sub-goals at least if @lewismc agrees as the lead proposer):

  1. get a source of regional polar data
  2. extract with Tika 2a. Develop Wordie service that leverages Tika (JAX-RS)
  3. dump into Gora/MongoDB
  4. throw Tangelo and GISCube at this, create a map
  5. win

Sound good? Thanks @smskiles and @allenpope and @curtislisle

allenpope commented 10 years ago

@chrismattmann - I think that sounds great, as long as some of it can be done on-the-fly by the user (e.g. select region/map area and a word cloud appears), as opposed to having a static product? What do you think would be most useful @smskiles?

chrismattmann commented 10 years ago

Yep agree @allenpope. Well get there it may start out as static though. I am going to do some pre hacking this week

allenpope commented 10 years ago

Makes sense - and awesome!

allenpope commented 10 years ago

@chrismattmann @smskiles One of the developers here at NSIDC made the good point that Wordles / word clouds often aren't the most useful visualization because they don't really let you read the small things and they distort the relative importance of things. Might be good to think about using something more quantitative instead / in addition, to display the relative importance of keywords, etc.

chrismattmann commented 10 years ago

Thanks @allenpope good points. We can start simple with Wordies/clouds, and then move to something more quantitative. I'll do some research on this.

smskiles commented 10 years ago

@allenpope I agree, a word cloud would be interesting to see, but might not be the most useful. It's not as exciting, but a ranked list might be more useful? (e.g. for a literature search, ranking by most recent or most cited). I think scale might be an interesting issue with this- i.e what is included/excluded based upon the size of your region of interest.