doc-analysis / TableBank

TableBank: A Benchmark Dataset for Table Detection and Recognition
Apache License 2.0
987 stars 139 forks source link

how to extract the dataset ? #34

Closed GTimothee closed 2 years ago

GTimothee commented 2 years ago

I downloaded the dataset parts but I cannot manage to extract the files correctly.

I tried different commands cited here: https://unix.stackexchange.com/questions/40480/how-to-unzip-a-multipart-spanned-zip-on-linux

But the only successful method was this one :

cat test.zip.* >test.zip
zip -FF test.zip --out test-full.zip
unzip test-full.zip

However, after the extraction one of the annotation json file is broken and has not been extracted correctly.

Can someone share their way to extract the dataset please ?

GTimothee commented 2 years ago

Some files were not fully downloaded. By the way it could be useful to use hashing for each file to ensure that the files are not corrupted.

MaveriQ commented 1 year ago

Hi. I am getting the same errors, despite downloading the dataset multiple times. Were you able to fix the errors with zip files? Thanks