commoncrawl Search Results

870 results
for commoncrawl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

commoncrawl/news-crawl #8

Bootstrap topology to add feeds and sitemaps from news sites

[DMOZ](http://www.dmoz.org/) is a good source to get a large list of news sites for various languages and countries: 1. define a set of English categories matching the topic 2. extract the translation…

sebastian-nagel updated 6 years ago
3
ScaleUnlimited/flink-crawler #111

Fix termination issues when job is cancelled

Currently when the flink-crawler job is cancelled, we get a bunch of errors (and other interesting output) in the log (see below). Some comments about this: 1. I think we need to make CommonCrawlFe…

kkrugler updated 6 years ago
1
commoncrawl/cc-mrjob #22

Not working anymore on EMR? "subprocess failed with code 1"

Hi, I have some problems letting the MR jobs run on EMR. The scripts work locally and they work with 1 or 10 warc files on EMR but with 100 I always get some failures of the type "PipeMapRed.waitOutpu…

joergrech updated 6 years ago
17
helgeho/ArchiveSpark #12

saveAsWarc with same warcPaths as the source

This code produces a .cdx and .warc file. ``` ArchiveSpark.load(sc, WarcCdxHdfsSpec(cdxPath = "/data/example.cdx.gz", warcPath = "/data")) .filter(r => r.surtUrl.startsWith("com,example")) .saveAs…

dportabella updated 6 years ago
3
helgeho/ArchiveSpark #9

load a warc archive without a cdx file

this code effectively loads the CDX index, and gets the warc pages if needed. `val rdd: RDD[WarcRecord] = ArchiveSpark.load(sc, WarcCdxHdfsSpec("/data/example.cdx.gz", "/data"))` I see that defini…

dportabella updated 6 years ago
4
salesforce/cove #11

word embeddings should be fixed during training?

In your paper Learned in Translation: Contextualized Word Vectors (McCann et. al. 2017), it says > We used the CommonCrawl-840B GloVe model for English word vectors, which were completely fixed d…

wirehack updated 6 years ago
1
allinurl/goaccess #1032

Support for FastMail CSV access log format?

My apologies if this is obvious, but I don't know that much about website logs. I'm using [FastMail](https://www.fastmail.com/) to [serve files](https://www.fastmail.com/help/files/website.html). …

mignon-p updated 6 years ago
2
helgeho/ArchiveSpark #4

cdx format, includes json

Dear helgeho, The common crawl its cdx files have a different structure then you library expects. `au,com,canberratimes)/business/mining-and-resources/bhp-on-the-hunt-for-new-chief-ft-20121106-28x…

borissmidt updated 6 years ago
26
archivesunleashed/aut #157

remove angle brackets from ArchiveRecord.getUrl

According to https://github.com/iipc/warc-specifications/issues/23, the standard says that WARC-Target-URI should be surrounded by , such as in: `WARC-Target-URI: http://www.archive.org/images/logoc.…

dportabella updated 6 years ago
6
commoncrawl/cc-mrjob #18

Upgrade to use boto3

[cc-pyspark](/commoncrawl/cc-pyspark) already uses boto3 to download data from s3://commoncrawl/: faster multi-part downloads and less errors (timeouts, "503 slow down"). The upgrade should improve th…

sebastian-nagel updated 6 years ago
1

上一页 1...74 75 76 77 78 79 80...87 下一页

870 results for commoncrawl

870 results
for commoncrawl