bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

warc2text

Extracts plain text, language identification and more metadata from WARC records

Download

Clone this repo along with submodules:

git clone --recurse-submodules https://github.com/bitextor/warc2text.git

Or:

git clone https://github.com/bitextor/warc2text.git
git submodule update --init --recursive

Install dependencies

On Debian/Ubuntu/Mint:

apt-get install build-essential cmake libuchardet-dev libzip-dev libboost-thread-dev libboost-regex-dev libboost-filesystem-dev libboost-log-dev libboost-iostreams-dev libboost-locale-dev libboost-program-options-dev

On Mac:

brew install uchardet libzip

Compile

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/your/prefix/path ..
# cmake .. -DCMAKE_BUILD_TYPE=Debug # for debug
# cmake .. -DICU_ROOT_DIR=(brew --prefix icu4c)/lib # for macOS
make -j
make install

Alternative installation with EasyBuild

On a node with EasyBuild installed you can install warc2text as a module:

eb --robot easyconfigs/uchardet-0.0.7-foss-2021a.eb 
eb --robot easyconfigs/nlpl-warc2text-1.2.0-foss-2021a.eb

Usage

note: for warcs with many languages you might hit the open file limit quite quickly. It is therefore advised to increase it, e.g. ulimit -n 8192.

warc2text -o <output_folder> [ -f <output_files> ] [ --pdfpass <output_warc> ]
          [ --paragraph-identification ] [ --tag-filters <filters_file> ] <warc_file>...

Output

When used with --output/-o (with optionally --files/-f), warc2text will produce the following directory structure at the path specified by --output:

In every file, each line corresponds to the same record. E.g. the fifth line in text.gz and fifth line in url.gz together give you the text and url for a single record.

The {lang} part of the path is determined by the classifier (see --classifier) and may be a two-letter or three-letter code depending on the classifier used. See this list for CLD2. When skipping the language identification with --classifier skip, all the files will be written directly to output folder without creating language specific folders.

When using --compression zstd files suffix will be .zst instead of gz.

JSONL

When using --jsonl, the output files that previously were encoded in base64, are now written in a single JSON record per line. With the keys "h" and "p" for the html file and the text file respectively.

stdout

Instead of the classic Bitextor directory structure and files, the --jsonl option can be combined with --stdout to write all the output to stdout, with the following keys (always in this order):

{
  f:  string, # filename of warc file (same as the `{filename}` part in `file.gz`)
  o:  number, # byte offset of record in warc file (same as `{offset}` in `file.gz`)
  s:  number, # warc file record size (same as `{size}` in `file.gz`)
  rs: number, # byte size of record payload (uncompressed)
  ps: number, # byte size of text only payload (so compare this against `rs` and you should get amount of HTML removed)
  l:  string, # identified language by classifier, omitted when language identification is skipped
  u:  string, # url
  c:  string, # content type as reported by the HTTP response header (or warc record header if that isn't present)
  ts: string, # crawl date/time as reported by the crawler
  p:  string, # plain text
}

More keys might be added in the future (e.g. the raw HTML is not included in JSONL to stdout now) and you should not expect the order of the keys to stay the same between different versions of warc2text.

Included dependencies

HTML Tokenizer by c-smile

HTML entities decoder by Christoph Gärtner

Charset detection using uchardet

Zip support for open document format using libzip


Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.