commoncrawl Search Results

869 results
for commoncrawl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

EleutherAI/the-pile #96

Scripts for dedup and filter Common Crawl?

Hi, I notice that the download URL for the [`CommonCrawlDataset`](https://github.com/EleutherAI/the-pile/blob/master/the_pile/datasets.py#L756) is `http://eaidata.bmk.sh/data/pile_cc_filtered_dedup…

shangw-nvidia updated 2 years ago
1
amazon-archives/aws-training-demo #4

unable to ssh on to master node

I am referring to process-commoncrawl-with-emr, writeup I was able to successfully create the clusters. But when I tried to ssh on to master node to execute hdfs and hadoop commands, it just did not …

skprasadu updated 6 years ago
1
internetarchive/Sparkling #3

`s3a` URLs don't work in `WarcLoader` (`Wrong FS: s3a://...`…

EDIT: this helped with `Wrong FS`, more tickets incoming ;) ```sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")``` Hey folks, I'm trying to read some common crawl data from S3...…

acruise updated 6 months ago
1
hlp-ai/mt-data #2

From CommonCrawl WET files, for each web page count the len…

hlp-ai updated 1 year ago
1
lc/gau #117

GAU .toml not working

This error prevents me from executing GAU correctly: WARN[0000] error reading config: open /home/kali/.gau.toml: no such file or directory

Disarm3 updated 3 months ago
6
commoncrawl/cc-index-server #7

[PyWB2] Remove "source" and "source-coll" fields from result…

With PyWB 2.x every result record contains two extra fields "source" and "source-coll" absent in the original index, e.g. ```json { "url": "http://commoncrawl.org/", "mime": "text/html", "m…

sebastian-nagel updated 3 years ago
1
microsoft/tensorwatch #8

Support for S3 Stream

Is it possible to use as FileStream a S3 file as `filename` in ```python watcher = tw.Watcher(filename=r'c:\temp\test.log', port=None) cli = tw.WatcherClient(r'c:\temp\sum.log') ``` I would l…

loretoparisi updated 5 years ago
1
opening-up-chatgpt/opening-up-chatgpt.github.io #88

Improve YAML format by including assessment date & model ver…

With the proliferation of models and model variants it becomes more important to track assessment dates and model versions. So far we've been able to treat model families as one, because it rarely …

mdingemanse updated 4 months ago
2
stanfordnlp/GloVe #133

Which common crawl does the "glove.840B.300d.zip" use?

Hello, I'm using the "Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)" pre-trained vectors to replicate a study. I ran the ``demo.sh`` smoothly, and I want to repro…

peoplecure updated 5 years ago
1
google-research/google-research #644

help for data(AndroidHowTo/crawled_instructions.json) of seq…

The instructions in [README.md](https://github.com/google-research/google-research/blob/master/seq2act/data_generation/README.md) have listed all the steps about how to generate the ```AndroidHowTo``…

lynneChan updated 2 years ago
3

上一页 1...5 6 7 8 9 10 11...87 下一页

869 results for commoncrawl

869 results
for commoncrawl