pdf files not included in the dataset

doc-analysis / DocBank

DocBank: A Benchmark Dataset for Document Layout Analysis

Apache License 2.0

583 stars 72 forks source link

pdf files not included in the dataset #31

Open Apurv3377 opened 3 years ago

Apurv3377 commented 3 years ago

I have been working on DocBank_samples since a month now. Today I downloaded the main dataset from onedrive and I could not see any pdf files! I wanted to request , If it is possible to provide the PDF files too?

I appreciate the help!

iiLaurens commented 3 years ago

I also wonder why no PDFs are provided. Providing only images + annotations puts restrictions on what kind of models can be build. What if I want a model that just takes PDFs and not just images?

Apurv3377 commented 3 years ago

Actually, I am also fine if only non-colored PDFs are provided. :) But Yes would be more helpful with annotated.

liminghao1630 commented 3 years ago

In fact, the PDF is derived from arXiv's papers during 2014-2018, but we currently have no plan to provide corresponding pdfs. Sorry about this.

andreagemelli commented 3 years ago

Hello @liminghao1630, thank you for your project. About pdfs, are you planning to publish at least the code for preprocessing them? I mean, if I get a quick look to an original paper corrisponding to your images I see some mismatch between their parts, e.g. images in different pages (and so bound boxes and annotations in general).

mattiasstahre commented 2 years ago

Does anyone know if the names of the PDFs are available? In that case I guess one could build a pipeline to download them.

jfreyberg commented 1 year ago

@liminghao1630 It's unfortunate that you do not publish the original files as it prohibits pdf-based models from using this dataset.

@mattiasstahre I think that from looking at names of the .txt and .jpg files in the dataset one can identify the arxiv URL. 1.tar_1401.0098.gz_TachyonPotentialsV12-PRD-enviado_8.txt => https://arxiv.org/abs/1401.0098 (Title: Tachyon potentials from a supersymmetric FRW model) 1.tar_1501.00050.gz_Godoy-Diana_etal_2014_Enzo_Levi_Workshop_4_ori.jpg => https://arxiv.org/abs/1501.00050 (Title: Four-winged flapping flyer in forward flight)

1.tar_1501.00050.gz_Godoy-Diana_etal_2014_Enzo_Levi_Workshop_4_ori.jpg => arxiv.org/abs/1501.00050

The page number is also available from the filename (0 indexed, so 4 is the 5th page).

I might build a crawler for this if I got time.