commoncrawl Search Results

898 results
for commoncrawl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

ipfs-inactive/archives #162

Common Crawl

https://commoncrawl.org/ > We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone. I'm not sure how much data it is, but certainly a few TB.

ghost updated 7 years ago
1
filecoin-project/filecoin-plus-large-datasets #2319

[DataCap Application] Mongo2Stor

### Data Owner Name Mongo2Stor ### What is your role related to the dataset Data Preparer ### Data Owner Country/Region United States ### Data Owner Industry Not-for-Profit ### Website [https://da…

amughal updated 7 months ago
9
lena-voita/the-story-of-heads #6

Dataset

Hi lena-voita and RachitBansal, I am trying to reproduce the experiment using the *WMT2018 (the Yandex corpus, EN-RU)*. However, the result I got wasn't very satisfying. I guess I might have …

NilesJiang updated 9 months ago
2
commonsearch/cosr-back #63

Integrate the new Common Crawl News dataset

http://commoncrawl.org/2016/10/news-dataset-available/ We should make sure it works with the current [common crawl source](https://github.com/commonsearch/cosr-back/blob/master/cosrlib/sources/common…

sylvinus updated 8 years ago
1
hplt-project/ia-download #1

Check whether all IPs returned for data.commoncrawl.org are …

I'm not sure whether Python does a DNS resolve for every `urlopen` call or not. I noticed that `data.commoncrawl.org` returns multiple IPs, so we could spread the load over multiple cloudfront servers…

jelmervdl updated 1 year ago
1
commoncrawl/cc-index-server #11

403 when locally hosted cc-index-server tries to connect to …

Whenever I do a search on the local cc-index-server I get errors. When I look at the debug logs, it looks like the final authorization is only using the access key ID and the secret, but not the sessi…

davetbo-amzn updated 1 year ago
5
tensorflow/tensor2tensor #1793

Wikisum generation fails with Python 3.7

### Description When trying to execute the CommonCrawl generator script (`tensor2tensor/data_generators/wikisum/get_references_commoncrawl.py`), I run into several issues that point towards compatibi…

dennlinger updated 4 years ago
2
s0md3v/Photon #118

Add more datasets

Love that you pull from wayback. There are more datasets that you can use too. Complimenting wayback with resources such as http://commoncrawl.org may result in more results.

nullenc0de updated 5 years ago
1
archivesunleashed/aut #556

`s3a` URLs don't work as in documentation

EDIT: this helped, the doc may need to be updated: ``` sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/") ``` **Describe the bug** According to the docs, `aut` should be able to r…

acruise updated 9 months ago
1
spencermountain/dumpster-dive #84

Other feeds

Great work for fixing this mate in 5.3.0 Importing EN now, do you know of other feeds people use with it? Have you ever thought about doing something like this with the CommonCrawl?

SimonBurfield updated 4 years ago
1

上一页 1...3 4 5 6 7 8 9...90 下一页

898 results for commoncrawl

898 results
for commoncrawl