-
I noticed that for `boto.s3.key.Key` you use `io.BufferedReader`.
boto3's `botocore.response.StreamingBody` has the attribute `_raw_stream` which it looks like you can just pass to `io.BufferedRea…
-
StatusUpdaterBolt if configured with routing `byDomain` should use the routing key from metadata (if provided in the field defined by `es.status.routing.fieldname`). Updates of the public suffix list …
-
Hej all!
This repo is really fast in reading the archives. I downloaded the first 100 segments of warc and I was running your file
cc-warc-examples/src/org/commoncrawl/examples/WARCReader…
-
Something about the gzip encoder used to create the CommonCrawl archives doesn't play well with `libflate`. It only seems to decompress the first few hundred bytes.
[Example file.](https://commoncr…
-
#### Problem description
I am trying to fine-tune a pretrained FastText using gensim. I use the weights from the official Facebook implementation. Partial loading works fine, but full model loa…
-
Hi all,
I'd like to use GloVe vectors as pretrained embeddings when training a text classifier. I downloaded the [glove.840B.300d.zip vectors](http://nlp.stanford.edu/data/glove.840B.300d.zip), un…
-
### Description
hi,guys,
Did someone try universal transformer in machine translation tasks?
My experiments with default settings does not surpass transformer in zh-en mt task.
-
Hi there!
I am interested in creating something like the equivalent of the `phf_codegen`, but for AcAutomaton.
use case: search across CommonCrawl's dataset in search of emojis quickly!
To construc…
-
I'm using the following configuration:
```
runners:
emr:
# Recommended region: us-east-1 (where Common Crawl data lives)
region: us-east-1
# Temp directory for mrjob
cloud…
-
I've tried using the "--csv" option of CCIndexWarcExport to extract some information based on a CSV file produced from Athena queries of the index. Unfortunately, it seems as though the ".warc.gz" fil…