commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

WARC files unreadabe #31

Closed AmrSheta22 closed 1 year ago

AmrSheta22 commented 1 year ago

ArchiveLoadFailed: Unknown archive format, first line: ['\\\x03$YgfÍZèÞç\x9cïËè\x1f1&Y\x995\x97÷\x9dç}\x9eç\x0bÝRÝ"Ñ\x0bz\x01\x1f¿\x15\x10\xa0Gñ\x0ey_£³!w©³6\x1aï\'\x81\x8dFÿ-\x84\x05 I cannot read the file and I get an error like this.

sebastian-nagel commented 1 year ago

Hi @AmrSheta22, could you add the code you are executing, a short description what it should do, the complete path to the WARC file which is read and also the full (Java) stack trace.

AmrSheta22 commented 1 year ago

Hi, turns out the WARC file I was reading is corrupted, sorry for the hustle. @sebastian-nagel

sebastian-nagel commented 1 year ago

Thanks for the update, @AmrSheta22 !