-
Charset detection works fine in most cases, but looking at the cases when it fails makes me think how it can be improved.
I found this page useful for a general guidance:
https://www.w3.org/Intern…
-
Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package
snippet to read warc file:
import warc
f = wa…
-
(cf. commoncrawl/ia-web-commons#3 and commoncrawl/ia-web-commons@3763ccbb)
For HTML elements only the attributes `name`, `rel`, `content` and `http-equiv` are extracted. The attribute `property` is …
-
I have been using findx for awhile now. So far I am liking it a lot better then DuckDuckGo and Gigablast. However, I had a little issue trying to find this respo and required me to do some searching. …
-
Hardware:
CPU:Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
Ram: 8 GB
GPU: GeForce GT 740M
Software:
Ubuntu 16
Tensorflow GPU Version: 1.2.1
I am trying to follow the walk-through tutorial howe…
-
I recently ended up attempting to process the sitemap located at https://www.autotrader.com/sitemap.xml
As you can see, the XML represents a sitemapindex as follows...
```
https://www.aut…
-
This is a follow-up to the discussion in #7329
@jeremyn @terrorist96
-
Implement a function (probably within an extra module) that takes a list of URLs as a Parameter and returns list of common crawl index output for each url.
See: [Common crawl index example](http://…
-
This problem puzzled me a day.
I wanted to retrain the pre-trained model after pruning the model by prune.py.
I downloaded the parallel data set of EnglishGerman from http://www.statmt.org/wmt14/tra…
-
Word embeddings are increasingly popular and provide nice accuracy gains in most parsers nowadays. I agree that we want to keep things simple (and hence there is a cost to allowing additional resource…