Closed jelmervdl closed 9 months ago
There might be something wrong with the byte offsets. I've not yet been able to use something like tail -c +${OFFSET} | zless
to jump to a particular record, and I've also not yet figured out why my offsets would be wrong.
Edit: I'm bad at reading. It starts showing at $OFFSET, so I need to jump to tail -c +$((OFFSET + 1)) | zless
and tada it works.
Also I'm not storing compressed record size, which would be helpful when dd
ing parts of warcs to create a selection. Technically you can look at the offset of the next record, but we're also skipping records that aren't interesting, so those differences are not always just the size of 1 record.
Also also I would like to have some content hashing so we can detect (near?) duplicates from the metadata. Google used to use simhash back in the day to remove duplicate search results. Not sure whether they have a better method these days. Definitely anything multilingual would be too expensive to run anyway.
Previously warc2text saved texts and urls in parallel files text.gz and url.gz in directories with language codes as names. To save timestamps for documents we need running warc2text with --jsonl flag. This results in all texts and meta information just written to stdout. This breaks the current pipelines and requires modifications of further steps (probably writing additional scripts doing exactly what warc2text does without --jsonl, i.e. duplicating logic already implemented in warc2text?). An alternative may be running warc2text two times, with and without --jsonl flag, but this requires 2x more time and disk space. At least for the purposes of filtering by robots.txt, would it be possible to have an option of just saving timestamps in a file paralllel to text.gz and url.gz?
To save timestamps for documents we need running warc2text with --jsonl flag.
You can use -f text,url,date
to output the date save the timestamps to a date.gz file.
This bit isn't entirely clear from the pull request, but in the updated readme it shows that I've also added options to produce all the new metadata in the old format.
Be sure to run with a ulimit -n unlimited
or something really high still when using the bitextor output format.
I'm having some reservations about whether the JSON output is valid UTF-8 (as I'm processing some of the output with Python and noticing issues). None of the code should produce invalid utf-8 as far as I can tell, but … keep an eye out for this when reviewing. I'll also look a bit more into that.
Hi,
Will this PR be merged?
Trying this branch I've seen a remarkable regression in speed. If it is meant to be like this because some feature, it is still a speed that we can afford, I guess. But wanted to point this out in case there is something badly optimized.
Batch: WIDE-20121227150417-crawl413
WARC ID / Method | 150417 | 153314 12260 | 154314 12261 | 155838 | 161939 | 165509 | 171541 | 172947 |
---|---|---|---|---|---|---|---|---|
warc2text master | 60s | 48s | 42s | 30s | 48s | 34s | 11s | 31s |
warc2text metadata-only | 130s | 105s | 88s | 65s | 99s | 74s | 11s | 66s |
warc2test metadata-only --jsonl |
121s | 99s | 80s | 60s | 93s | 69s | 11s | 51s |
The full command:
./warc2text_json/build/bin/warc2text \
--classifier fasttext --fasttext-model lid218e.bin \
--url-filters warc2text-runner/url-filter-list.optimised \
--tag-filters warc2text-runner/mt-filter-list.annotated \
--paragraph-identification -o text/ \
--silent --jsonl $i >/dev/null
Did you compare fasttext to fasttext, and the non-jsonl command without --jsonl
? --jsonl
takes precedence over --output/-o
.
I ran it locally, with a fresh checkout of master and this branch (with the last changes to master merged in, so same fastertext) and all speeds are pretty comparable for me:
branch: master, bitext
real 6m34.636s
user 6m30.540s
sys 0m3.178s
branch: metadata-only, bitext
real 6m37.463s
user 6m32.968s
sys 0m3.247s
branch: metadata-only, jsonl
real 6m11.547s
user 6m19.867s
sys 0m3.391s
Benchmark I ran (single run on my battery powered laptop, but laptop is not throttling or anything so I trust it):
#!/bin/bash
set -euo pipefail
ulimit -f unlimited
profile() {
local prog=$1
local output=$2
shift 2
$prog \
--classifier fasttext \
--fasttext-model lid201-model.ftz \
--url-filters ../warc2text-runner/url-filter-list.optimised \
--tag-filters ../warc2text-runner/mt-filter-list.annotated \
--paragraph-identification \
--output $output \
--silent WIDE-20180405202949-00696.warc.gz \
"$@"
}
echo "branch: master, bitext"
rm -rf out-main
time profile ../warc2text/build-master/bin/warc2text out-main/
echo "branch: metadata-only, bitext"
rm -rf out-json
time profile ../warc2text/build/bin/warc2text out-json/
echo "branch: metadata-only, jsonl"
time profile ../warc2text/build/bin/warc2text out-json --jsonl | gzip -9c > out-json.jsonl.gz
Edit: for fun, output sizes!
du -sh out-*
104M out-main
104M out-json
52M out-json.jsonl.gz
However, this issue still exists:
gzip -cd out-json.jsonl.gz | python3 -c "import json,sys
for line in sys.stdin:
json.loads(line)
"
Gives:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 5022-5023: invalid continuation byte
Edit: this seems to be the case because the JSON output contains the entire payload under the p
key, which doesn't need to be utf-8 because the logic about how and when to convert which bit of data is pretty messy:
https://github.com/bitextor/warc2text/blob/8be93933d2913fac67706066bd999b60ad9fa590/src/record.cc#L222-L243
(Note that extracted
is a temp var in this snippet, the payload
is the p
key in json and plaintext
is the t
key.)
(Also mentioning #48 here, but that's not a solution since valid JSON always has to be valid UTF-8 and apparently the Boost library I'm using does not guarantee that bit, i.e. uses escape sequences to encode the invalid byte sequence.)
The speed regression seems to be solved now. Synced with master brought back fast execution times.
Regarding the invalid UTF-8, I think JSON does not change things much, compared to what we had before. The error that you are getting with python is not exactly a JSON parsing error. The json.loads
will receive always valid UTF-8 in that loop, because iterating over sys.stdin
already calls implicit .decode('utf8')
. That exception is, therefore thrown by sys.stdin
because probably your environment has errors="strict"
as default (?).
If I do:
zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
for i in sys.stdin:
json.loads(i.strip())'
because in the env I use (idk if this is a default difference between Mac and Linux, or depends on other things), errors="escapesurrogate"
is the default. It gives:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 1870: invalid continuation byte
But without reconfigure (because escape is the default in my env) or explicitly using errors="replace"
or errors="escapesurrogate"
, I don't get any error.
Also, If I just read the input without any JSON parsing or read from the base64 input, I get the same decoding errors:
zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
for i in sys.stdin:
continue'
zcat text/*/text.gz | base64 -d | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
for i in sys.stdin:
continue'
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 4432: invalid continuation byte
So, in the end, I think it depends on what the downstream tool decides to do. For example jq
is able to parse all the JSONs I've generated because it replaces the invalid characters with the surrogate. Probably the UTF-8 invalid character handling discussion can be moved to other place other than this PR.
Although we may not use the output as it is designed here, the PR seems to be stable enough and it doesn't interfere with Bitextor format. So I'm merging it.
This is mostly to do some metadata analysis of the warcs, but could be a starting point for #34 as well.
For metadata I'm considering trying out writing to parquet directly. But since warc2text is run in parallel we'd still need to merge parquet files together before doing any analysis. So maybe jsonl is sufficient for this stage. And then we ingest all of those together into a massive parquet file for queries later.
Current output: each line contains a JSON object that consists of:
f
: filename of warc fileo
: byte offset of record in warc files
: warc file record sizers
: byte size of record payload (uncompressed)ps
: byte size of text only payload (so compare this againstrs
and you should get amount of HTML removed)l
: identified language by classifieru
: urlc
: content type as reported by the HTTP response header (or warc record header if that isn't present)p
: plain textTodo:
ts
: crawl date as found in the record header (no date normalisation or anything)pt
: per paragraph/line inp
the most nested tag it was found in: Should this be an array of strings? Or a string separated by newlines to matchp
?~~pi
: paragraph identifiers as normally produced byget_paragraph_id()
Same question as forpt
, or even just keep this function as-is and add the paragraph identifiers insidep
which is a real mess but might be easiest for compatibility?~~ Moving these things to #46.I also want to make these new columns available to the original bitext output as possible arguments for
-f
.--multilang
is also supported for the CLD2 classifier. In that case you'd get multiple json lines per record, one for each identified language. The attributes that relate to the record itself will be duplicated, onlyp
,ps
andl
differ.Usage:
So 2Gb of warc yields about 2Mb of jsonlines.
Getting actual metadata from it: