OCHA-DAP / Data-Team

A place for tracking data team issues
0 stars 1 forks source link

New data partnership: Ubicity #37

Closed JavierTeran closed 10 years ago

JavierTeran commented 10 years ago

Note from David

Javier and Godfrey:

I just finished a call with Sigmund ("Sigi") Kluckner and Hermann Huber at the Ubicity project in Austria. Their tool is a set of extensions around the Elastisearch search engine, a Solr alternative, for real-time big-data indexing. I don't think it makes sense for us to switch search engines at this point (especially when ReliefWeb and HR.info are both using Solr, and we'll want to have combined search eventually).

However, we did find a potentially-interesting collaboration for 2014. The Ubicity project is already analyzing geocoded Tweets (as are many other projects, of course). They could generate data about those tweets, re-geocoded into humanitarian p-codes, that we could bring into CKAN and (eventually) the analytical web site, to combine with other data.

I thought one good possibility would be for Javier and Godfrey to come up with a list of a few hundred humanitarian keywords for them to watch for, like "water / agua", "food / alimento(s) / alimentaria", "roadblock / barricada", "violence / violencia", "(land)mine(s) / mina(s)", etc. (with Arabic equivalents for Yemen, of course), and then have the Ubicity team upload, say, a weekly dataset to our CKAN for each country, with occurrences of each keyword group in each administrative level. They can also backfill the datasets back at least a few months, since they have saved past Tweets for analysis.

Since only about 2% of Tweets are geocoded, we might find that there's not enough information to be useful, but a negative result is still an important result in any kind of research, and then we'll know not to invest in this kind of data gathering in 2015. In the meantime, this seems like a relatively low-overhead collaboration for us: they would require no integration with the dev team, and only occasional back-and-forth with the data team. The hardest part would be developing that list of humanitarian keywords — perhaps that already exists, or perhaps they could even help with that by analyzing the existing tweets for patterns. That list would be reusable for other types of analysis.

Do you agree with an approach like this, Javier and Godfrey?

Notes from Luis:

A few notes here:

ReliefWeb runs on Solr (I think), but their newly released API runs on ElasticSearch. My impression is that they are liking ElasticSearch so much that they are considering changing -- when appropriate of course.

On creating a list of terms, I would suggest using CrisisLex. CrisisLex is a lexicon developed by the bright researchers at QCRI for collecting and analyzing 'microblogging data', that is, Twitter data. It only covers the English language, but reportedly they are doing research on Arabic and Spanish as well.

QCRI has been doing excellent research on using Twitter data during crises and humanitarian emergencies. In fact, I think they should be among the best researchers in the field in that area. ChaTo is one of their researchers and has written some very interesting pieces. They've also developed the Artificial Intelligence for Disaster Response open-source application that does exactly that: collects Twitter data from crises around the world. They've assembled some very interesting data already and are working on creating an API for exporting that data to the public.

If I may, I would suggest Ubicity to work with QCRI in that area -- or simply to piggy back on their research (using CrisisLex or AIDR). It would be great to have some Twitter data in HDX Repo!

takavarasha commented 10 years ago

The humanitarian keywords could be automatically retrieved from the tags list from the hdx repository ckan instance api.

Regards, Godfrey Takavarasha Sent from a mobile device On Apr 10, 2014 11:57 AM, "Javier Teran" notifications@github.com wrote:

Note from David

Javier and Godfrey:

I just finished a call with Sigmund ("Sigi") Kluckner and Hermann Huber at the Ubicity project in Austria. Their tool is a set of extensions around the Elastisearch search engine, a Solr alternative, for real-time big-data indexing. I don't think it makes sense for us to switch search engines at this point (especially when ReliefWeb and HR.info are both using Solr, and we'll want to have combined search eventually).

However, we did find a potentially-interesting collaboration for 2014. The Ubicity project is already analyzing geocoded Tweets (as are many other projects, of course). They could generate data about those tweets, re-geocoded into humanitarian p-codes, that we could bring into CKAN and (eventually) the analytical web site, to combine with other data.

I thought one good possibility would be for Javier and Godfrey to come up with a list of a few hundred humanitarian keywords for them to watch for, like "water / agua", "food / alimento(s) / alimentaria", "roadblock / barricada", "violence / violencia", "(land)mine(s) / mina(s)", etc. (with Arabic equivalents for Yemen, of course), and then have the Ubicity team upload, say, a weekly dataset to our CKAN for each country, with occurrences of each keyword group in each administrative level. They can also backfill the datasets back at least a few months, since they have saved past Tweets for analysis.

Since only about 2% of Tweets are geocoded, we might find that there's not enough information to be useful, but a negative result is still an important result in any kind of research, and then we'll know not to invest in this kind of data gathering in 2015. In the meantime, this seems like a relatively low-overhead collaboration for us: they would require no integration with the dev team, and only occasional back-and-forth with the data team. The hardest part would be developing that list of humanitarian keywords — perhaps that already exists, or perhaps they could even help with that by analyzing the existing tweets for patterns. That list would be reusable for other types of analysis.

Do you agree with an approach like this, Javier and Godfrey?

Notes from Luis:

A few notes here:

ReliefWeb runs on Solr (I think), but their newly released API runs on ElasticSearch. My impression is that they are liking ElasticSearch so much that they are considering changing -- when appropriate of course.

On creating a list of terms, I would suggest using CrisisLex. CrisisLex is a lexicon developed by the bright researchers at QCRI for collecting and analyzing 'microblogging data', that is, Twitter data. It only covers the English language, but reportedly they are doing research on Arabic and Spanish as well.

QCRI has been doing excellent research on using Twitter data during crises and humanitarian emergencies. In fact, I think they should be among the best researchers in the field in that area. ChaTo is one of their researchers and has written some very interesting pieces. They've also developed the Artificial Intelligence for Disaster Response open-source application that does exactly that: collects Twitter data from crises around the world. They've assembled some very interesting data already and are working on creating an API for exporting that data to the public.

If I may, I would suggest Ubicity to work with QCRI in that area -- or simply to piggy back on their research (using CrisisLex or AIDR). It would be great to have some Twitter data in HDX Repo!

— Reply to this email directly or view it on GitHubhttps://github.com/OCHA-DAP/Data-Team/issues/37 .

JavierTeran commented 10 years ago

Guys, will take it out of this list for now.