commoncrawl Search Results

869 results
for commoncrawl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

MusicConnectionMachine/UnstructuredData #50

Discuss interfaces to Group 1 and 3

felixschorer updated 7 years ago
10
modernmt/DataCollection #3

Check whether we are using ranged requests to access subsect…

https://groups.google.com/forum/?hl=en#!msg/common-crawl/0fYTJtFD6Fs/qblUucr9BgAJ;context-place=forum/common-crawl "You should be using ranged requests to avoid downloading the whole archive. Try this…

achimr updated 7 years ago
1
modernmt/DataCollection #5

error installing pyrocksdb

I get an error when trying to install the requirements with pip install -r requirements.txt. I'm highlighting interesting bits in bold. I'm using Ubuntu 14.04. Thanks in advance for your help. R…

menpente updated 7 years ago
5
DigitalPebble/sc-warc #5

add sha digest field

see [https://github.com/commoncrawl/nutch/blob/3551eb6dbb7f7152a13d2e4eb0f8eb6014dc8252/src/java/org/commoncrawl/tools/WarcExport.java#L140]

jnioche updated 7 years ago
7
piskvorky/smart_open #12

Unable to read gz files on s3

It reads them, but it the data remains compressed, thus defeating line iteration.

coreyhuinker updated 7 years ago
15
crate/crate-commoncrawl #6

Could not able to load the data.

I am trying to work with Common Crawl file and came across Crate.io. I am following the instructions as specified in the instruction part. I am having trouble with this command: `COPY commoncrawl F…

JafferWilson updated 7 years ago
13
commoncrawl/webarchive-indexing #2

dosample fails with encoding error when writing sequence fil…

When writing the sequence file (at the very end of the job) dosamply.py sometimes fails with an encoding error: ``` + python dosample.py --verbose --shards=300 --splitfile=s3a://cc-cdx-index/2014-49…

sebastian-nagel updated 7 years ago
1
webrecorder/pywb #199

cdx-server json output

this [spec](https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#output-format-json) describe the json output of a cdx server as an array of arrays. pywb cdx-server ret…

atomotic updated 7 years ago
3
DigitalPebble/sc-warc #11

Missing trailing CRLF after warcinfo record

The warcinfo record returned by [WARCRecordFormat](../blob/master/src/main/java/com/digitalpebble/stormcrawler/warc/WARCRecordFormat.java) lacks a trailing CRLF which causes some WARC libraries fail t…

sebastian-nagel updated 7 years ago
2
iipc/webarchive-commons #66

support WET files

CommonCrawl has the [WET files](http://commoncrawl.org/the-data/get-started/), which are WARC files where HTML response has been converted to plain text (and non html pages has been removed). Is it p…

dportabella updated 7 years ago
3

上一页 1...79 80 81 82 83 84 85...87 下一页

869 results for commoncrawl

869 results
for commoncrawl