questions: get plain text from common crawl

lli130 commented 7 years ago

Dear Mr. Sebastian Nagel @sebastian-nagel, I am the team member of Fordham University S & T team. Would you help me to get plain text content from common crawl. I have collected some useful URLs by using common crawl index API, it that possible to use these URLs together with WET file to crawl web text content? Is that necessary to use WARC file at the same time? Thank you so much.

Regards, Liyi Li.

unite-analytics commented 7 years ago

Hello @lli130 , while we wait for help from Sebastian please see the bottom of this readme file. I just added some new instructions there on how to get full text from the common crawl. The instructions are not quite perfect, but they might help you:

https://github.com/ICT4SD/Science_Technology_Search/blob/ICT4SD_tools/README.md

lli130 commented 7 years ago

@unite-analytics Thank you so much!

sebastian-nagel commented 7 years ago

Hi @lli130, unfortunately the index API provides only offsets to the WARC files. WET files have the same name (except directory path and extension), but they are smaller, so reusing the WARC offsets can be only an approximation (see also [1]). In doubt, it may be cheaper to parse the HTML fetched from the WARC file that to search for the corresponding record in the WET file.

sebastian-nagel commented 7 years ago

@lli130: would be nice to share those "useful" URLs as a resource to "steer" the crawler

@unite-analytics: where to find url_fetch_commoncrawl.py?

lli130 commented 7 years ago

@sebastian-nagel Thank you for response! The URLs, URL_FETCH results, and corresponding python codes are all included in https://github.com/ICT4SD/Science_Technology_Search/tree/Fordham-BEETHOVEN

unite-analytics commented 7 years ago

Hello Sebastian @sebastian-nagel and Sylvain @sylvinus : what would be your recommended approach if we wanted to index into an Elastic Search server the full text of webpages in this list of 15 domains?

https://github.com/ICT4SD/Science_Technology_Search/blob/Fordham-BEETHOVEN/government%20domains%20of%2015%20UN%20security%20council

Any suggestion would be appreciated!

sylvinus commented 7 years ago

Hi @unite-analytics !

I'd say the main question is: how exhaustive do you want the index to be? If the coverage of Common Crawl is enough (Sebastian may be able to give more stats on it), great! If not, you will have to start a new crawl first.

For the indexation, if indexing the fulltext and sending simple queries is your only need, I recommend piping the .wet files straight to elasticsearch with a simple script.

If your end goal is to build a search engine with more features, I definitely recommend investigating the Common Search pipeline, starting here: https://about.commonsearch.org/developer/tutorials/analyzing-the-web-with-spark-on-ec2

I'm happy to provide support on how to adapt it to your needs.

unite-analytics commented 7 years ago

Hey Sylvain @sylvinus thanks for your prompt response! The content of the CommonCrawl is fine for now. We want to avoid having to do our own crawl if possible.

Our end goal is to make a search engine for the websites of about 190 country domains (.gov, .gouv.fr , etc.), a couple of thousand of university domains (mit.edu, paristech.fr, etc.).

After getting the full text of the pages, we want to apply a classification step, where each document will be tagged with one or more of 17 categories, and some documents should be eliminated (e.g. the "contact us" or the documents which do not fall into any of the 17 categories).

Then after classification, everything should go to Elastic Search.

The tutorial on the Common Search pipeline seems excellent, we'll give it a try, but would appreciate your help and guidance! Thanks, Jorge

unite-analytics commented 7 years ago

Hello Sylvain @sylvinus. I managed to follow the tutorial locally, it's very well written, thanks! Would you have any plugin from a past project that took as input a domain (e.g. .gouv.fr) and saved or ingest to elastic the metadata and content of the relevant pages found?

I suppose one idea is to have a plugin which can handle even just one domain, and then the process can be run multiple times (in multiple machines) for other domains to split the workload.

Any suggestions? Jorge

sylvinus commented 7 years ago

Hi Jorge!

Actually, the plugins run inside Spark so it's not your job to split the workload, Spark will do it for you automatically :)

Based on the code of the filter plugin used in the tutorial, it's already doing a suffix filtering: https://github.com/commonsearch/cosr-back/blob/master/plugins/filter.py#L46

So if you do this kind of command (first blacklisting all domains, and then selectively whitelisting them back for indexation): spark-submit --verbose /cosr/back/spark/jobs/pipeline.py --source commoncrawl:limit=8,maxdocs=1000 --plugin plugins.filter.All:skip=1 --plugin "plugins.filter.Domains:index_body=1,domains=.gouv.fr .gov"

You will end up with the right ones in Elasticsearch :)

From there it would be easy to write a new plugin that reads your file instead of the command-line options.

unite-analytics commented 7 years ago

Thanks @sylvinus !

I followed your tips and managed to get results from the domains i want =) and also managed to save them in Apache parquet format. This is the command i ran: spark-submit --verbose \ /cosr/back/spark/jobs/pipeline.py \ --source commoncrawl:limit=2,maxdocs=500 \ --plugin plugins.filter.All:skip=1 \ --plugin "plugins.filter.Domains:index_body=1,domains=.gouv.fr .gob.mx .gov.af .gov" \ --plugin plugins.dump.DocumentMetadata:output=out/save_results/,abort=1

And this is the kind of results i get in the parquet file: {"id": 6491477653774423921, "url": "http://angelscamp.gov/index.php/museum-top", "rank": 0.0095379538834095} {"id": 1275145380204344402, "url": "http://aps.anl.gov/bcda/synApps/motor/R5-4/motor_release.html", "rank": 0.0009488448849879205}

Question: how can i get the actual text in each Url? or has the text already been ingested into ElasticSearch?

Thanks again for any hints! J

sylvinus commented 7 years ago

Hi!

Glad you're making it work!

As its name suggests, the DocumentMetadata plugin doesn't currently dump the full text, just the metadata. However as you used index_body=1, the full text should indeed already be in the ElasticSearch instance (which is the one that was started by Docker, but you can provide any external ES URL via the config).

unite-analytics commented 7 years ago

Hello @sylvinus I managed to insert documents into elasticsearch so i'm making steady progress =). For example to insert urls from the .edu domain using: spark-submit --verbose /cosr/back/spark/jobs/pipeline.py --source commoncrawl:limit=2,maxdocs=500 --plugin plugins.filter.All:skip=1 --plugin "plugins.filter.Domains:index_body=1,domains=.edu"

thanks again for the help above!

My next question is about elasticsearch. I can see the collection and search for documents using queries like these:

curl 'http://localhost:9200/_search?q=body:health&pretty=true' and curl 'http://localhost:9200/_search?q=title:health&pretty=true'

and the responses look like this:

  "hits" : {
    "total" : 23,
    "max_score" : 5.4524527,
    "hits" : [ {
      "_index" : "text",
      "_type" : "page",
      "_id" : "-7523425410665115244",
      "_score" : 5.4524527
    }, {
      "_index" : "text",
      "_type" : "page",
      "_id" : "4790237554859872660",
      "_score" : 5.2839737
    },...etc...

Or if i request a specific document with:

curl -XGET 'localhost:9200/text/page/8751299815594780343?stored_fields=*&pretty' the response is:

{
  "_index" : "text",
  "_type" : "page",
  "_id" : "8751299815594780343",
  "_version" : 1,
  "found" : true
}

Finally, to get to the title of documents and snippets you can execute a command like below. This will return documents containing the search query "education":

$ curl -XGET '172.19.02:9200/docs/_search?q=education&pretty'

"hits" : { "total" : 33, "max_score" : 1.6600497, "hits" : [ { "_index" : "docs", "_type" : "page", "_id" : "-7638175531656596612", "_score" : 1.6600497, "_source" : { "url" : "http://marywood.edu/education/graduate-programs/grad-requirements.html", "summary" : "Education Department: General Graduate Requirements", "title" : "Education Department: General Graduate Requirements | Marywood..." } }, { "_index" : "docs", "_type" : "page", "_id" : "8972333847422052425", "_score" : 1.4228997, "_source" : { "url" : "http://ric.edu/generaleducation/honors.php", "summary" : "To complete General Education Honors, students must take a minimum of five General Education courses in specially designed honors sections. Courses chosen...", "title" : "General Education" } ...............

Note: replace the IP address for the address of your container named: cosrbackshell_elasticsearch_1 You can obtain that address by first finding your running container:

$docker ps Then get the IP address: $ docker inspect cosrbackshell_elasticsearch_1

sylvinus commented 7 years ago

Hi Jorge!

For optimization, we "index" all the fields in the text index, but we don't "store" them. So the only thing you get back from that index will be the _idof the document.

To get the metadata for the pages that matched, you should query the other index, which is named docs. You have an example of this in Python here: https://github.com/commonsearch/cosr-back/blob/master/cosrlib/searcher.py#L133

I don't know what your constraints are on the frontend side, but you could either use a forked version of our frontend (https://github.com/commonsearch/cosr-front) or at least review its code to start your own!

By the way, I'm guessing you will want to customize the way you index documents, so I recommend you have a look at the index.py file to see how you'd do it.

lli130 commented 7 years ago

Hello @sylvinus ! The following code can help us find documents containing a list of words in their indexable text. Input are targeting domains, output are lists of documents containing filtering documents.

spark-submit --verbose spark/jobs/pipeline.py --source commoncrawl:limit=8,maxdocs=500 --plugin "plugins.filter.Domains:index_body=1,domains=.un.org .undp.org" --plugin "plugins.grep.Words:words=science technology,output=out/words_st"

Would you know how i can open the "localhost:4040"? Since I cannot open Spark on localhost, I cannot get the spark master IP. Thanks for any hints.

unite-analytics commented 7 years ago

@lli130 this is not quite what you are asking, but it might help:

1) To see the list of docker containers you have created on your machine try: $ docker ps -a 2) Take the name of the container that you want to connect to.

3) Then to see the details (including the IP) of that container, try: $ docker inspect your_container_name

However, i noticed the containers created by the commonsearch scripts don't have the 4040 or other ports open by default, so i connected to the specific containers and worked from within them. For example to open a terminal in a running container, try:

$docker exec -it your_running_container_name bash

If for any reason the container wasn't running, first start it with: $docker start your_container_name

From within the container you can access services using localhost:port

unite-analytics commented 7 years ago

@sylvinus thanks for the clarifications above. I was busy for a few weeks but now i'm getting back into this.

ICT4SD / Science_Technology_Search

questions: get plain text from common crawl #1