Open chrismattmann opened 10 years ago
Recent progress of Angela:
(1) [Done] Use the Apache Nutch and Solr to crawl and index local data files (2) [Done] Index content metadata and parse metadata from the Apache Nutch to Solr. (3) [Done] Integrate the Apache OODT File Manager with the Apache Solr using the RADiX (4) [Doing] Crawl the ACADIS website using the Apache Nutch and Solr.
Thanks @snowangelwmy please contact @pzimdars to get your ACADIS Nutch crawler deployed on AWS, ok?
Ok, when I am done, I will contact @pzimdars. Thanks.
Progress of Vineet;
beginning by downloading Nutch.
NSF ACADIS:https://www.aoncadis.org/home.htm
NSIDC Arctic Data Explorer: http://nsidc.org/acadis/search/
hi @chrismattmann the regex-urlfilter.txt can be found here
https://www.dropbox.com/s/hl6wlvwbr4xrv81/regex-urlfilter.txt?dl=0
Update properties in conf/nutch-default.xml:
http.agent.name = NSF DataViz Hackathon Crawler
http.agent.email=mattmann@usc.edu
http.agent.host=localhost
http.content.limit=-1
plugin.includes delete indexer-solr
./bin/crawl urls/ crawl http://localhost 3
Please make sure your JAVA_HOME environment variable is set.
export JAVA_HOME=/usr
echo $JAVA_HOME
Download Solr:
http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.2
Download the tika app from:
curl -k -O https://repository.apache.org/service/local/repo_groups/snapshots-group/content/org/apache/tika/tika-app/1.7-SNAPSHOT/tika-app-1.7-20141103.165816-465.jar
mkdir $HOME/tmp/tika
mv tika-app-1.7-20141103.165816-465.jar $HOME/tmp/tika
alias tika="java -jar $HOME/tmp/tika/tika-app-1.7-20141103.165816-465.jar"
tika -m ftp://sidads.colorado.edu/pub/DATASETS/AMSRE/TOOLS/land_mask/Sea_Ice_V003/amsr_gsfc_12n.hdf
Content-Length: 547605
Content-Type: application/x-hdf
HDF4_Version: 4.1.3 (NCSA HDF Version 4.1 Release 3, May 1999)
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.hdf.HDFParser
_History: Direct read of HDF4 file through CDM library
resourceName: amsr_gsfc_12n.hdf
Try:
tika --help
Try this:
tika -m ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02202_v2/north/daily/2013/seaice_conc_daily_nh_f17_20130102_v02r00.nc
Hi Prof @chrismattmann , why do I need to install tika-app? The nutch already has parse-tika component.
@snowangelwmy if you look at the URL @chrismattmann defined, you will see that he's referenced a SNAPSHOT. This is so we can use some of the newer features of Tika. Try it out :) Also we are hacking Tika at this hackathon so we are using the development versions for parsing .grb files.
Got it! I have crawled some ACADIAS web pages ("numFound": 572). However, all files that have been indexed into my solr are of type "application/xhtml+xml". I am wondering how to crawl files of the other types, e.g., pdf, jpg? Thank you!
For Solr, please find the Nutch schema here:
curl -O http://svn.apache.org/repos/asf/nutch/trunk/conf/schema.xml
Command for checking nutch crawled data:
./bin/nutch readseg -dump ./crawl/segments/20141103100202/ output
Check out this wiki page: http://wiki.apache.org/solr/SolrJetty
OK, ignore that wiki page. In your $HOME/tmp/solr-4.10.2/example directory, type java -jar start.jar
You will find this page that suggests how to fix the schema.xml issue: http://stackoverflow.com/questions/15945927/apache-nutch-and-solr-integration
Please comment out like so in schema.xml:
<!-- <filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>-->
Also ignore if it says undefined field text.
Access: http://localhost:8983/solr/
First try running ./bin/nutch solrindex You should get back:
Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
No IndexWriters activated - check your configuration
Now we have to add back in indexer-solr by adding it to conf/nutch-default in the plugin.includes property:
..|indexer-solr|...
Now with it enabled, you get:
Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Now run this:
./bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/collection1/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20141103120137
Nutch plugins for index-* plugins:
hi @snowangelwmy we installed tika-app to show how tika works out of the box (without Nutch). We could have used parser checker as well in Nutch.
ACK
On Mon, Nov 3, 2014 at 10:06 PM, Chris Mattmann notifications@github.com wrote:
hi @snowangelwmy https://github.com/snowangelwmy we installed tika-app to show how tika works out of the box (without Nutch). We could have used parser checker as well in Nutch.
— Reply to this email directly or view it on GitHub https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1#issuecomment-61587294 .
Lewis
Add a new field named "text" in your schema.xml file to solve the "undefined field text" error.
Hi guys. I am looking into your S3 data, and find that all data has been ingested into solr/nutch. I was doing topic identification on metadata, how could I access these metadata?
Hi @AranyaLi I think what you want to do is to download and install Solr and then take the index data that Angela made available for Solr (the Solr dir) and then point Solr at that.
Hi Guys
As about said i have configured nutch-site.xml with this
Where did i mistaken ? Iam Using hadoop 2.7.2, Solr 5.4.1 and nutch 1.12 version Could any one help me out of this Query .
Running cmd : bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/#/collections crawl/crawldb -linkdb crawl/linkdb crawl/segments/20160604193022
Indexer: starting at 2016-06-04 20:12:24 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false No IndexWriters activated - check your configuration
Indexer: number of documents indexed, deleted, or skipped: Indexer: 111 indexed (add/update) Indexer: finished at 2016-06-04 20:12:28, elapsed: 00:00:04
Thanks Narendra k
hi @narendrakadari are you crawling polar data?
Hi Chrismattmann thanks for immediate reply
No Iam not using polar data
my urls/seed.txt file contains
http://www.flipkart.com/
http://www.amazon.com/
http://www.shopalike.in/
http://www.infibeam.com/
http://www.iabse.org/
I was facing this issue from past 14 days, kindly help
Thanks Narendra k
Building off of https://github.com/NCEAS/open-science-codefest/issues/26, continue data prep and crawl of AMD, ACADIS and ADE with goal of preparing some of the data for (GeoViz; science focused viz, etc.)
Participants would use real world data science tools like Tika (http://tika.apache.org/), Nutch (http://nutch.apache.org/), Solr (http://lucene.apache.org/solr/) and OODT (http://oodt.apache.org/) to crawl and prepare the datasets of interesting Polar parameters for Visualization experts to then hack on during a 2 day NSF visualization hackathon in NYC in November. Be part of doing something real, contributing to Apache projects (and getting the merit and potentially becoming a committer and PMC member yourself) and also contributing to NSF and NASA goals!