applicaai / CCpdf

Index of URLs to pdf files all over the internet and scripts
MIT License
20 stars 3 forks source link

PDFs download #2

Open malteos opened 1 year ago

malteos commented 1 year ago

Hi,

thanks for sharing this project. Will the actual PDF dataset be made available as well? Or is there any other way to avoid rerunning the whole pipeline again?

Best, Malte

SushantDaga commented 1 year ago

I think pipeline mentioned in paper is not provided (yet, 🤞 maybe it will be provided by authors in some time?)

For now the final index and a script to download urls that made the cut after running the pipeline on MAY-2022 CC dataset is provided.

tballison commented 1 year ago

May be of interest? https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/

SushantDaga commented 1 year ago

Thanks @tballison! This is definitely of interest.

@malteos @tballison want to join forces and make an open source replication of CCpdf pipeline?

tballison commented 1 year ago

Always happy to collaborate!

MichalTurski commented 1 year ago

We shared all the data and the code we could while being compliant to our company data policy. Personally I keep my fingers crossed to your open source replication of the pipeline (I hope the paper will be useful for you)!

I keep this thread open for future discussions on pipeline replication/access to PDFs from other crawls.

SushantDaga commented 1 year ago

Apologies for the delay. Have started work in repo: https://github.com/SushantDaga/ThePDFCorpus to replicate CC-PDF pipeline and results.

Any contribution will be greatly appreciated :)