-
Hi,
I notice that the download URL for the [`CommonCrawlDataset`](https://github.com/EleutherAI/the-pile/blob/master/the_pile/datasets.py#L756) is `http://eaidata.bmk.sh/data/pile_cc_filtered_dedup…
-
I am referring to process-commoncrawl-with-emr, writeup
I was able to successfully create the clusters. But when I tried to ssh on to master node to execute hdfs and hadoop commands, it just did not …
-
EDIT: this helped with `Wrong FS`, more tickets incoming ;)
```sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")```
Hey folks, I'm trying to read some common crawl data from S3...…
-
-
This error prevents me from executing GAU correctly:
WARN[0000] error reading config: open /home/kali/.gau.toml: no such file or directory
-
With PyWB 2.x every result record contains two extra fields "source" and "source-coll" absent in the original index, e.g.
```json
{
"url": "http://commoncrawl.org/",
"mime": "text/html",
"m…
-
Is it possible to use as FileStream a S3 file as `filename` in
```python
watcher = tw.Watcher(filename=r'c:\temp\test.log', port=None)
cli = tw.WatcherClient(r'c:\temp\sum.log')
```
I would l…
-
With the proliferation of models and model variants it becomes more important to track assessment dates and model versions.
So far we've been able to treat model families as one, because it rarely …
-
Hello,
I'm using the "Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)" pre-trained vectors to replicate a study.
I ran the ``demo.sh`` smoothly, and I want to repro…
-
The instructions in [README.md](https://github.com/google-research/google-research/blob/master/seq2act/data_generation/README.md) have listed all the steps about how to generate the ```AndroidHowTo``…