-
(Using code from https://github.com/webrecorder/warcio/issues/57)
Calling record.content_stream().read() before writing the record causes the record to be changed in such a way that the file it write…
-
## Expected behavior
Pictures contained in the attached warc.gz should be served and displayed.
[republik1_0.warc.gz](https://github.com/webrecorder/pywb/files/10497691/republik1_0.warc.gz)
[republ…
-
we are testing openwayback using a .warc file generated by heritrix.
we run openwayback on centos7+tomcat7. OWB seems capable of indexing urls the .warc file. however, when we click the version (da…
-
This may be inherent in the WARC format, but we have a site that responds to a URL with either JSON or HTML depending on the request type (XmlHTTP or HTTP). In the WARC file after a crawl that retriev…
atiro updated
7 years ago
-
Attempts to process this segment:
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz
sta…
-
When full text (`content`) is indexed, if words are only separated by linebreaks, they are joined without space in between.
A random example of value from the `content` attribute for the front page…
-
### Are you submitting a **bug report** or a **feature request**?
Bug report
### What is the current behavior?
I get an error when I want to start a crawl. This is the error
```node
Run…
-
Hi guys,
after downloading and extracting the Turkish part of the OSCAR 21.09 release, I've found some sentences with encoding errors:
![image](https://user-images.githubusercontent.com/20651387…
-
Can we use e.g. a counting stream-reader to work out how long each WARC record is (compressed?).
-
The `some`-branch adds support for direct indexing of JSON-Lines from the Twitter API. Retrieval is handled by storing the raw JSON in the Solr field `tw_json`, but that inflates the index.
Another…
tokee updated
3 years ago