NSF-Polar-Cyberinfrastructure / datavis-hackathon

http://nsf-polar-cyberinfrastructure.github.io/datavis-hackathon
42 stars 11 forks source link

Crawl and prepare NSF ACADIS, NASA AMD and NSIDC Arctic Data Explorer datasets Part 2 #1

Open chrismattmann opened 10 years ago

chrismattmann commented 10 years ago

Building off of https://github.com/NCEAS/open-science-codefest/issues/26, continue data prep and crawl of AMD, ACADIS and ADE with goal of preparing some of the data for (GeoViz; science focused viz, etc.)

Participants would use real world data science tools like Tika (http://tika.apache.org/), Nutch (http://nutch.apache.org/), Solr (http://lucene.apache.org/solr/) and OODT (http://oodt.apache.org/) to crawl and prepare the datasets of interesting Polar parameters for Visualization experts to then hack on during a 2 day NSF visualization hackathon in NYC in November. Be part of doing something real, contributing to Apache projects (and getting the merit and potentially becoming a committer and PMC member yourself) and also contributing to NSF and NASA goals!

snowangelwmy commented 10 years ago

Recent progress of Angela:

(1) [Done] Use the Apache Nutch and Solr to crawl and index local data files (2) [Done] Index content metadata and parse metadata from the Apache Nutch to Solr. (3) [Done] Integrate the Apache OODT File Manager with the Apache Solr using the RADiX (4) [Doing] Crawl the ACADIS website using the Apache Nutch and Solr.

chrismattmann commented 10 years ago

Thanks @snowangelwmy please contact @pzimdars to get your ACADIS Nutch crawler deployed on AWS, ok?

snowangelwmy commented 10 years ago

Ok, when I am done, I will contact @pzimdars. Thanks.

hemantku commented 10 years ago

Progress of Vineet;

  1. Developed a GRIB parser and progress about the feature is present - https://issues.apache.org/jira/browse/TIKA-1423
  2. An initial patch has been published at https://reviews.apache.org/r/27414/. I am working on the suggestions raised by the reviewers.
chrismattmann commented 10 years ago

beginning by downloading Nutch.

chrismattmann commented 10 years ago

NASA AMD: http://gcmd.gsfc.nasa.gov/KeywordSearch/Keywords.do?Portal=amd&KeywordPath=Parameters%7CCRYOSPHERE&MetadataType=0&lbnode=mdlb2

NSF ACADIS:https://www.aoncadis.org/home.htm

NSIDC Arctic Data Explorer: http://nsidc.org/acadis/search/

lewismc commented 10 years ago

hi @chrismattmann the regex-urlfilter.txt can be found here

https://www.dropbox.com/s/hl6wlvwbr4xrv81/regex-urlfilter.txt?dl=0

chrismattmann commented 10 years ago

Update properties in conf/nutch-default.xml:

http.agent.name = NSF DataViz Hackathon Crawler
http.agent.email=mattmann@usc.edu
http.agent.host=localhost
http.content.limit=-1
plugin.includes delete indexer-solr
chrismattmann commented 10 years ago
./bin/crawl urls/ crawl http://localhost 3
chrismattmann commented 10 years ago

Please make sure your JAVA_HOME environment variable is set.

chrismattmann commented 10 years ago

export JAVA_HOME=/usr

chrismattmann commented 10 years ago

echo $JAVA_HOME

chrismattmann commented 10 years ago

Download Solr:

http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.2
chrismattmann commented 10 years ago

Download the tika app from:

curl -k -O https://repository.apache.org/service/local/repo_groups/snapshots-group/content/org/apache/tika/tika-app/1.7-SNAPSHOT/tika-app-1.7-20141103.165816-465.jar
chrismattmann commented 10 years ago
mkdir $HOME/tmp/tika
mv tika-app-1.7-20141103.165816-465.jar $HOME/tmp/tika
alias tika="java -jar $HOME/tmp/tika/tika-app-1.7-20141103.165816-465.jar"
chrismattmann commented 10 years ago
tika -m ftp://sidads.colorado.edu/pub/DATASETS/AMSRE/TOOLS/land_mask/Sea_Ice_V003/amsr_gsfc_12n.hdf
chrismattmann commented 10 years ago
Content-Length: 547605
Content-Type: application/x-hdf
HDF4_Version: 4.1.3 (NCSA HDF Version 4.1 Release 3, May 1999)
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.hdf.HDFParser
_History: Direct read of HDF4 file through CDM library
resourceName: amsr_gsfc_12n.hdf
chrismattmann commented 10 years ago

Try:

tika --help
chrismattmann commented 10 years ago

Try this:

tika -m ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02202_v2/north/daily/2013/seaice_conc_daily_nh_f17_20130102_v02r00.nc
snowangelwmy commented 10 years ago

Hi Prof @chrismattmann , why do I need to install tika-app? The nutch already has parse-tika component.

lewismc commented 10 years ago

@snowangelwmy if you look at the URL @chrismattmann defined, you will see that he's referenced a SNAPSHOT. This is so we can use some of the newer features of Tika. Try it out :) Also we are hacking Tika at this hackathon so we are using the development versions for parsing .grb files.

snowangelwmy commented 10 years ago

Got it! I have crawled some ACADIAS web pages ("numFound": 572). However, all files that have been indexed into my solr are of type "application/xhtml+xml". I am wondering how to crawl files of the other types, e.g., pdf, jpg? Thank you!

chrismattmann commented 10 years ago

For Solr, please find the Nutch schema here:

curl -O http://svn.apache.org/repos/asf/nutch/trunk/conf/schema.xml
liwwchina commented 10 years ago

Command for checking nutch crawled data:

./bin/nutch readseg -dump ./crawl/segments/20141103100202/ output

chrismattmann commented 10 years ago

Check out this wiki page: http://wiki.apache.org/solr/SolrJetty

chrismattmann commented 10 years ago

OK, ignore that wiki page. In your $HOME/tmp/solr-4.10.2/example directory, type java -jar start.jar

chrismattmann commented 10 years ago

You will find this page that suggests how to fix the schema.xml issue: http://stackoverflow.com/questions/15945927/apache-nutch-and-solr-integration

chrismattmann commented 10 years ago

Please comment out like so in schema.xml:

<!--                <filter class="solr.SnowballPorterFilterFactory"
                        language="English"
                    protected="protwords.txt"/>-->
chrismattmann commented 10 years ago

Also ignore if it says undefined field text.

Access: http://localhost:8983/solr/

chrismattmann commented 10 years ago

First try running ./bin/nutch solrindex You should get back:

Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
No IndexWriters activated - check your configuration
chrismattmann commented 10 years ago

Now we have to add back in indexer-solr by adding it to conf/nutch-default in the plugin.includes property:

..|indexer-solr|...
chrismattmann commented 10 years ago

Now with it enabled, you get:

Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : use authentication (default false)
    solr.auth : username for authentication
    solr.auth.password : password for authentication
chrismattmann commented 10 years ago

Now run this:

./bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/collection1/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20141103120137
chrismattmann commented 10 years ago

Nutch plugins for index-* plugins:

http://svn.apache.org/repos/asf/nutch/trunk/src/plugin

chrismattmann commented 10 years ago

Check out this wiki page: http://webcache.googleusercontent.com/search?q=cache:m4_gv1UqiQYJ:https://wiki.apache.org/nutch/IndexMetatags+&cd=1&hl=en&ct=clnk&gl=us

chrismattmann commented 10 years ago

Solr DIH: http://wiki.apache.org/solr/DataImportHandler

chrismattmann commented 10 years ago

hi @snowangelwmy we installed tika-app to show how tika works out of the box (without Nutch). We could have used parser checker as well in Nutch.

lewismc commented 10 years ago

ACK

On Mon, Nov 3, 2014 at 10:06 PM, Chris Mattmann notifications@github.com wrote:

hi @snowangelwmy https://github.com/snowangelwmy we installed tika-app to show how tika works out of the box (without Nutch). We could have used parser checker as well in Nutch.

— Reply to this email directly or view it on GitHub https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1#issuecomment-61587294 .

Lewis

snowangelwmy commented 10 years ago

Add a new field named "text" in your schema.xml file to solve the "undefined field text" error.

anyayunli commented 9 years ago

Hi guys. I am looking into your S3 data, and find that all data has been ingested into solr/nutch. I was doing topic identification on metadata, how could I access these metadata?

chrismattmann commented 9 years ago

Hi @AranyaLi I think what you want to do is to download and install Solr and then take the index data that Angela made available for Solr (the Solr dir) and then point Solr at that.

narendrakadari commented 8 years ago

Hi Guys

As about said i have configured nutch-site.xml with this

http.agent.name crawl plugin.includes protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata) fetcher.server.delay 0.5 http.timeout 10000 http.content.limit 131027

Where did i mistaken ? Iam Using hadoop 2.7.2, Solr 5.4.1 and nutch 1.12 version Could any one help me out of this Query .

Running cmd : bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/#/collections crawl/crawldb -linkdb crawl/linkdb crawl/segments/20160604193022

Indexer: starting at 2016-06-04 20:12:24 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false No IndexWriters activated - check your configuration

Indexer: number of documents indexed, deleted, or skipped: Indexer: 111 indexed (add/update) Indexer: finished at 2016-06-04 20:12:28, elapsed: 00:00:04

Thanks Narendra k

chrismattmann commented 8 years ago

hi @narendrakadari are you crawling polar data?

narendrakadari commented 8 years ago

Hi Chrismattmann thanks for immediate reply

No Iam not using polar data

my urls/seed.txt file contains
http://www.flipkart.com/ http://www.amazon.com/ http://www.shopalike.in/ http://www.infibeam.com/ http://www.iabse.org/

I was facing this issue from past 14 days, kindly help

Thanks Narendra k