bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

all urls are filtered in case the file specified in --url-filters doesn't exist #53

Closed nvanva closed 7 months ago

nvanva commented 8 months ago

warc2text -f url,html --skip-text-extraction --classifier skip --url-filters non-existing-file -o tmp ~/hplt/one/warc/cc/CC-MAIN-2022-40/CC-MAIN-20220924151538-20220924181538-00000.warc.gz

results in all urls filtered out. Probably, it would be better to exit with an error if specified file doesn't exist.