-
When parsing part of the CommonCrawl corpus (which consists of ~1G WARC files where each _record_ is individually compressed), flate2 will return EOF after the first chunk has been decompressed rather…
-
Currently, the development of machine learning tools are in several different packages without little coordination. Consequently, some efforts are repetitive, while some important aspects remain lacki…
-
Method createSegment() should create a new segment (file) and not override the existing one; however, this is not the case on S3.
This should be updated
```
FSDataOutputStream fsStream = (progress =…
-
See https://support.pivotal.io/hc/en-us/articles/202810986-Mapper-output-key-value-NullWritable-can-cause-reducer-phase-to-move-slowly
-
Strangely `zlib-bindings-0.1.1.5` returns a truncated result while decompressing one of the Common Crawl WARC archives.
Consider `ZTest.hs`,
``` haskell
import Control.Monad (unless)
import qualifie…
-
Currently, there is a [phrasecount](https://github.com/keith-turner/phrasecount) example for Accismus. While phrasecount is helpful in learning Accismus, it would be great to have an example app base…
-
@ugermann @davidecaroselli @nicolabertoldi @mfederico
I have many indicators that there is something going wrong somewhere.
I was not able to spot the problem, but I report my findings and hope in y…
-
Tokenizer's Java process uses 15GB of RAM and keep growing...
Data: 1B words en fr WMT task, (News+Europarl+UN+CommonCrawl)
```
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ CO…
-
I'm attempting to run a job on EMR. I'm seeing my job fail due to raised JSONDecodeError in underlying code.
There are several places where this exception is raised. For example, it happens while the…
-
Hi,
regarding global index harvesting, is it possible to add extra tuning options and functionality for p2p and settings for all yacy peers (volunteerally) to contribute indexes for archiving p2p node…