commoncrawl Search Results

869 results
for commoncrawl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

apache/incubator-stormcrawler #438

Improve charset detection

Charset detection works fine in most cases, but looking at the cases when it fails makes me think how it can be improved. I found this page useful for a general guidance: https://www.w3.org/Intern…

lukb updated 7 years ago
9
commoncrawl/news-crawl #11

News WARC files processing issue.

Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package snippet to read warc file: import warc f = wa…

sandeepsingh updated 6 years ago
15
iipc/webarchive-commons #67

Add attribute "property" of HTML meta elements to WAT HTML-M…

(cf. commoncrawl/ia-web-commons#3 and commoncrawl/ia-web-commons@3763ccbb) For HTML elements only the attributes `name`, `rel`, `content` and `http-equiv` are extracted. The attribute `property` is …

sebastian-nagel updated 7 years ago
1
privacore/open-source-search-engine #34

Link to respo on the homepage?

I have been using findx for awhile now. So far I am liking it a lot better then DuckDuckGo and Gigablast. However, I had a little issue trying to find this respo and required me to do some searching. …

ROCKNROLLKID updated 6 years ago
7
tensorflow/tensor2tensor #97

AttributeError: 'NoneType' object has no attribute 'vocab_si…

Hardware: CPU:Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz Ram: 8 GB GPU: GeForce GT 740M Software: Ubuntu 16 Tensorflow GPU Version: 1.2.1 I am trying to follow the walk-through tutorial howe…

agemagician updated 7 years ago
4
crawler-commons/crawler-commons #168

<sitemapindex> not being processed

I recently ended up attempting to process the sitemap located at https://www.autotrader.com/sitemap.xml As you can see, the XML represents a sitemapindex as follows... ``` https://www.aut…

lewismc updated 7 years ago
5
EFForg/https-everywhere #7338

When to disable rulesets

This is a follow-up to the discussion in #7329 @jeremyn @terrorist96

Hainish updated 6 years ago
17
MusicConnectionMachine/UnstructuredData #67

Automatically retireve cc index info on supplied URLs

Implement a function (probably within an extra module) that takes a list of URLs as a Parameter and returns list of common crawl index output for each url. See: [Common crawl index example](http://…

lukasstreit updated 7 years ago
3
harvardnlp/seq2seq-attn #92

Preprocess failed for the WMT'14

This problem puzzled me a day. I wanted to retrain the pre-trained model after pruning the model by prune.py. I downloaded the parallel data set of EnglishGerman from http://www.statmt.org/wmt14/tra…

IdiosyncraticDragon updated 7 years ago
4
ufal/conll2017 #4

Additional Resources?

Word embeddings are increasingly popular and provide nice accuracy gains in most parsers nowadays. I agree that we want to keep things simple (and hence there is a cost to allowing additional resource…

slavpetrov updated 7 years ago
28

上一页 1...77 78 79 80 81 82 83...87 下一页

869 results for commoncrawl

869 results
for commoncrawl