commoncrawl Search Results

869 results
for commoncrawl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

amocsub/easm #7

Scan HackerOne

# EASM Create a comment with any of the following templates for the tools and Github Actions would take it and trigger the corresponding application and return the results from the tool in a new comm…

amocsub updated 3 months ago
2
Kaspect/polar #1

logistics

For sunset website ![image](https://cloud.githubusercontent.com/assets/4623063/12933984/2d129e94-cf40-11e5-8695-56283a1a8c91.png)

briancohn updated 8 years ago
3
mlfoundations/dclm #59

Any plans to release pools after refinedweb heuristic filter…

Thank you for the great work! The repo is great for reproducing the entire data processing pipeline, but a lot of people (including me) seem particularly interested in studying the final quality f…

CodeCreator updated 22 hours ago
9
tensorflow/datasets #2187

Connection Error while trying download c4/en.realnewslike

I using default script to download c4 with config en.realnewslike, and getting an error RuntimeError: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='commoncrawl.s3.amazonaws.com', po…

nikregrado updated 3 years ago
5
filplus-bookkeeping/DAYOU #3

[DataCap Application] Commoncrawl

### Version 1 ### DataCap Applicant FileTech ### Project ID FileTech-02 ### Data Owner Name CommonCrawl ### Data Owner Country/Region United States ### Data Owner Industry Life Science / He…

nike-mp updated 3 weeks ago
34
commoncrawl/cc-mrjob #30

AWS EMR issues

I had some problems running on AWS EMR with the default mrjob.conf. In case anyone else is running into similar issues, I found that I needed to make two minor changes to mrjob.conf: change python2.7 …

DallanQ updated 3 years ago
2
webrecorder/cdxj-indexer #7

Feature Requests / questions on use --> Pipe, Readme

Few Feature requests and/or requests for help using cdxj-indexer! --> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to our respective projects…

jwest75674 updated 4 years ago
2
PetrochukM/PyTorch-NLP #61

Support loading fasttext model from custom file

What if I want to use own pretrained fasttext model (or even commoncrawl model instead of standard wiki one)? E.g. look what they publish now: https://fasttext.cc/docs/en/crawl-vectors.html. Current …

keanpantraw updated 3 years ago
5
CLD2Owners/cld2 #58

Which languages are supported

* Readme: "These 83 languages are detected" * python (import cld2; help(cld2)): 161 languages (175 language-script combinations), 240 total language-script combinations * python (import cld2; print(…

MartinThoma updated 11 months ago
1
meta-llama/llama #296

Paper questions: Common Crawl processing questions

There are a few details missing from the paper that are required to really understand what data was actually used for training LLAMA. The paper notes: > We preprocess five CommonCrawl dumps, ran…

joshalbrecht updated 1 year ago
1

上一页 1...7 8 9 10 11 12 13...87 下一页

869 results for commoncrawl

869 results
for commoncrawl