DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
I thought I had already reported this, but apparently not. Currently the character set detection uses all the bytes that it reads from the input stream. If called with a stream, ICU limits itself to the first 8K bytes, because that should be enough to determine what the character encoding is, but if it's handed a buffer instead, it uses the entire thing. For very large documents, this is inefficient without adding any accuracy.
I thought I had already reported this, but apparently not. Currently the character set detection uses all the bytes that it reads from the input stream. If called with a stream, ICU limits itself to the first 8K bytes, because that should be enough to determine what the character encoding is, but if it's handed a buffer instead, it uses the entire thing. For very large documents, this is inefficient without adding any accuracy.