bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Add `--jsonl` option #35

Closed jelmervdl closed 9 months ago

jelmervdl commented 1 year ago

This is mostly to do some metadata analysis of the warcs, but could be a starting point for #34 as well.

For metadata I'm considering trying out writing to parquet directly. But since warc2text is run in parallel we'd still need to merge parquet files together before doing any analysis. So maybe jsonl is sufficient for this stage. And then we ingest all of those together into a massive parquet file for queries later.

Current output: each line contains a JSON object that consists of:

Todo:

I also want to make these new columns available to the original bitext output as possible arguments for -f.

--multilang is also supported for the CLD2 classifier. In that case you'd get multiple json lines per record, one for each identified language. The attributes that relate to the record itself will be duplicated, only p, ps and l differ.

Usage:

> ll *.warc.gz
 Size Name
1.1Gi CC-MAIN-20221126080725-20221126110725-00000.warc.gz
1.1Gi WIDE-20171021194807-00260.warc.gz

> bin/warc2text --jsonl *.warc.gz | pigz -9c > metadata.jsonl.gz
[2023-02-17 14:41:02.338945] [info] Processing CC-MAIN-20221126080725-20221126110725-00000.warc.gz
[2023-02-17 14:42:02.002268] [info] Processing WIDE-20171021194807-00260.warc.gz
[2023-02-17 14:42:13.112524] [info] total records: 46660
[2023-02-17 14:42:13.112559] [info] text records: 44405
[2023-02-17 14:42:13.112567] [info] lang records: 40914
[2023-02-17 14:42:13.112574] [info] total bytes: 1456844861
[2023-02-17 14:42:13.112580] [info] text bytes: 328455828
[2023-02-17 14:42:13.112587] [info] lang bytes: 285338976
[2023-02-17 14:42:13.112593] [info] elapsed: 0h1m10s

> ll metadata.jsonl.gz
 Size Name
2.1Mi metadata.jsonl.gz

So 2Gb of warc yields about 2Mb of jsonlines.

Getting actual metadata from it:

> pigz -cd metadata.jsonl.gz | jq --raw-output .u | head
http://0337.maymay520.com/V4/?AID=164332&FID=1782326&WEBID=AVSHOW
http://064.ehiroba.jp/shopdetail/000000000660/ct91/page2/order/
http://095160170158.vectranet.pl/wiadomosci/item/12047-obchody-czerwca-76-z-rekomendacja-komisji-kultury
http://1118.cctv.com/2019/12/30/VIDErVoqOYK0GveK5J2BaDvq191230.shtml
http://114hzw.com/zhanhuipaiqi/industry/jiajujiaji/
http://120rcw.com/about/jinjia.html
http://123nu.dk/lystfiskeri/forum/registration_rules.asp?FID=0&SID=3cae74fcfz339af2f3f86321e46511e3
http://123stopfire.com/Fra/Fr_p1_01.html
http://1368.info/soi-cau-3-cang/
http://1801202223.djtom.cz/%D9%8A%D9%85%D9%83%D9%86-%D8%A3-%D9%8A%D9%83%D9%88%D9%86-%D9%86%D8%B8%D8%B1%D8%A7.html/
jelmervdl commented 1 year ago

There might be something wrong with the byte offsets. I've not yet been able to use something like tail -c +${OFFSET} | zless to jump to a particular record, and I've also not yet figured out why my offsets would be wrong.

Edit: I'm bad at reading. It starts showing at $OFFSET, so I need to jump to tail -c +$((OFFSET + 1)) | zless and tada it works.

Also I'm not storing compressed record size, which would be helpful when dding parts of warcs to create a selection. Technically you can look at the offset of the next record, but we're also skipping records that aren't interesting, so those differences are not always just the size of 1 record.

Also also I would like to have some content hashing so we can detect (near?) duplicates from the metadata. Google used to use simhash back in the day to remove duplicate search results. Not sure whether they have a better method these days. Definitely anything multilingual would be too expensive to run anyway.

nvanva commented 11 months ago

Previously warc2text saved texts and urls in parallel files text.gz and url.gz in directories with language codes as names. To save timestamps for documents we need running warc2text with --jsonl flag. This results in all texts and meta information just written to stdout. This breaks the current pipelines and requires modifications of further steps (probably writing additional scripts doing exactly what warc2text does without --jsonl, i.e. duplicating logic already implemented in warc2text?). An alternative may be running warc2text two times, with and without --jsonl flag, but this requires 2x more time and disk space. At least for the purposes of filtering by robots.txt, would it be possible to have an option of just saving timestamps in a file paralllel to text.gz and url.gz?

jelmervdl commented 11 months ago

To save timestamps for documents we need running warc2text with --jsonl flag.

You can use -f text,url,date to output the date save the timestamps to a date.gz file.

This bit isn't entirely clear from the pull request, but in the updated readme it shows that I've also added options to produce all the new metadata in the old format.

Be sure to run with a ulimit -n unlimited or something really high still when using the bitextor output format.

jelmervdl commented 11 months ago

I'm having some reservations about whether the JSON output is valid UTF-8 (as I'm processing some of the output with Python and noticing issues). None of the code should produce invalid utf-8 as far as I can tell, but … keep an eye out for this when reviewing. I'll also look a bit more into that.

akutuzov commented 10 months ago

Hi,

Will this PR be merged?

ZJaume commented 10 months ago

Trying this branch I've seen a remarkable regression in speed. If it is meant to be like this because some feature, it is still a speed that we can afford, I guess. But wanted to point this out in case there is something badly optimized.

Batch: WIDE-20121227150417-crawl413

WARC ID / Method 150417 153314 12260 154314 12261 155838 161939 165509 171541 172947
warc2text master 60s 48s 42s 30s 48s 34s 11s 31s
warc2text metadata-only 130s 105s 88s 65s 99s 74s 11s 66s
warc2test metadata-only --jsonl 121s 99s 80s 60s 93s 69s 11s 51s

The full command:

./warc2text_json/build/bin/warc2text \
--classifier fasttext --fasttext-model lid218e.bin \
--url-filters warc2text-runner/url-filter-list.optimised \
--tag-filters warc2text-runner/mt-filter-list.annotated \
--paragraph-identification -o text/ \
--silent --jsonl $i >/dev/null
jelmervdl commented 10 months ago

Did you compare fasttext to fasttext, and the non-jsonl command without --jsonl? --jsonl takes precedence over --output/-o.

I ran it locally, with a fresh checkout of master and this branch (with the last changes to master merged in, so same fastertext) and all speeds are pretty comparable for me:

branch: master, bitext
real    6m34.636s
user    6m30.540s
sys 0m3.178s

branch: metadata-only, bitext
real    6m37.463s
user    6m32.968s
sys 0m3.247s

branch: metadata-only, jsonl
real    6m11.547s
user    6m19.867s
sys 0m3.391s

Benchmark I ran (single run on my battery powered laptop, but laptop is not throttling or anything so I trust it):

#!/bin/bash
set -euo pipefail
ulimit -f unlimited

profile() {
  local prog=$1
  local output=$2
  shift 2
  $prog \
    --classifier fasttext \
    --fasttext-model lid201-model.ftz \
    --url-filters ../warc2text-runner/url-filter-list.optimised \
    --tag-filters ../warc2text-runner/mt-filter-list.annotated \
    --paragraph-identification \
    --output $output \
    --silent WIDE-20180405202949-00696.warc.gz \
    "$@"
}

echo "branch: master, bitext"
rm -rf out-main
time profile ../warc2text/build-master/bin/warc2text out-main/

echo "branch: metadata-only, bitext"
rm -rf out-json
time profile ../warc2text/build/bin/warc2text out-json/

echo "branch: metadata-only, jsonl"
time profile ../warc2text/build/bin/warc2text out-json --jsonl | gzip -9c > out-json.jsonl.gz

Edit: for fun, output sizes!

du -sh out-*
104M    out-main
104M    out-json
 52M    out-json.jsonl.gz
jelmervdl commented 10 months ago

However, this issue still exists:

gzip -cd out-json.jsonl.gz | python3 -c "import json,sys
                                                for line in sys.stdin:
                                                    json.loads(line)
                                                "

Gives:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 5022-5023: invalid continuation byte

Edit: this seems to be the case because the JSON output contains the entire payload under the p key, which doesn't need to be utf-8 because the logic about how and when to convert which bit of data is pretty messy: https://github.com/bitextor/warc2text/blob/8be93933d2913fac67706066bd999b60ad9fa590/src/record.cc#L222-L243 (Note that extracted is a temp var in this snippet, the payload is the p key in json and plaintext is the t key.)

(Also mentioning #48 here, but that's not a solution since valid JSON always has to be valid UTF-8 and apparently the Boost library I'm using does not guarantee that bit, i.e. uses escape sequences to encode the invalid byte sequence.)

ZJaume commented 10 months ago

The speed regression seems to be solved now. Synced with master brought back fast execution times.

Regarding the invalid UTF-8, I think JSON does not change things much, compared to what we had before. The error that you are getting with python is not exactly a JSON parsing error. The json.loads will receive always valid UTF-8 in that loop, because iterating over sys.stdin already calls implicit .decode('utf8'). That exception is, therefore thrown by sys.stdin because probably your environment has errors="strict" as default (?).

If I do:

zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                 for i in sys.stdin:
                                                     json.loads(i.strip())'

because in the env I use (idk if this is a default difference between Mac and Linux, or depends on other things), errors="escapesurrogate" is the default. It gives:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 1870: invalid continuation byte

But without reconfigure (because escape is the default in my env) or explicitly using errors="replace" or errors="escapesurrogate", I don't get any error.

Also, If I just read the input without any JSON parsing or read from the base64 input, I get the same decoding errors:

zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                       for i in sys.stdin:
                                                           continue'
zcat text/*/text.gz | base64 -d | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                 for i in sys.stdin:
                                                     continue'
Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 4432: invalid continuation byte

So, in the end, I think it depends on what the downstream tool decides to do. For example jq is able to parse all the JSONs I've generated because it replaces the invalid characters with the surrogate. Probably the UTF-8 invalid character handling discussion can be moved to other place other than this PR.

ZJaume commented 9 months ago

Although we may not use the output as it is designed here, the PR seems to be stable enough and it doesn't interfere with Bitextor format. So I'm merging it.