-
langstat2candidates.py, particularly when used with the `-candidates` parameter uses up large amounts of RAM (needing 32-64 GB of RAM for large language pairs). This is because it reads the entire can…
-
Is there a Java or a Kotlin binding for this library?
-
This bug (https://github.com/internetarchive/surt/issues/20) reported against the Python `surt` module is also present in the Java implementation.
This URL
`http://example.com/script?type=a+b+%26…
-
I get the error
```
Traceback (most recent call last):
File "cdx-index-client.py", line 382, in
main()
File "cdx-index-client.py", line 379, in main
read_index(r, info['cdx-api'], i…
-
In addition to CDXJ, the [ZipNum format](https://github.com/ikreymer/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx) uses a secondary index, which also includes a sortable url key but contains other da…
-
There are quite a few now:
https://github.com/commoncrawl/ia-web-commons/security/dependabot
Also I am confused by the relationship between this repo and https://github.com/iipc/webarchive-commo…
-
For sunset website
![image](https://cloud.githubusercontent.com/assets/4623063/12933984/2d129e94-cf40-11e5-8695-56283a1a8c91.png)
-
I had some problems running on AWS EMR with the default mrjob.conf. In case anyone else is running into similar issues, I found that I needed to make two minor changes to mrjob.conf: change python2.7 …
-
Few Feature requests and/or requests for help using cdxj-indexer!
--> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to our respective projects…
-
If a WARC request record contains and overlong and truncated HTTP request header line (`GET /path HTTP/1.1`) HttpRequestMessageParser throws an exception which causes that the request record is not tr…