evaristoc / fCC_R3_DataAnalysis

0 stars 1 forks source link

Checking the use of commoncrawl datasets; crawling references #24

Open evaristoc opened 6 years ago

evaristoc commented 6 years ago

http://commoncrawl.org/the-data/ http://commoncrawl.org/the-data/examples/ https://groups.google.com/forum/#!forum/common-crawl https://nlp.stanford.edu/pubs/cluster-wsdm09.pdf http://resources.mpi-inf.mpg.de/d5/teaching/ws09_10/socialnetworks/talks/Thomas_v_Bomhard.pdf

The best access seems to be through Elastic Map-reduce clusters at AWS. Part of the service is paid. There is a python SDK: https://github.com/commoncrawl/cc-mrjob

There is also an API (:tada:!!) https://github.com/ikreymer/pywb/wiki/CDX-Server-API#api-reference

evaristoc commented 6 years ago

wget python https://pypi.python.org/pypi/wget

webbrowser https://docs.python.org/3/library/webbrowser.html

text browsing python 3.x

execute bash python 3.x https://ubuntuforums.org/showthread.php?t=2264907

HTTP Error 502: Bad Gateway http://www.checkupdown.com/status/E502.html https://stackoverflow.com/questions/37506648/i-am-getting-urllib2-httperror-http-error-502-bad-gateway https://stackoverflow.com/questions/23036169/getting-502-bad-gateway-error-and-sending-a-email-with-django-nginx-gunicorn https://github.com/iexg/IEX-API/issues/4

urllib https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

requests, urrlib python 3.x see headers https://stackoverflow.com/questions/14949644/python-get-header-information-from-url ttp://docs.python-requests.org/en/master/user/quickstart/

get all text of a website python https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

evaristoc commented 6 years ago

multiprocessing python http://www.thegeekstuff.com/2012/03/linux-threads-intro/?utm_source=tuicool

subprocess open and close python 3 linux https://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true

assigning to a subprocess the same pid https://stackoverflow.com/questions/17856928/how-to-terminate-process-from-python-using-pid