centic9 / CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
BSD 2-Clause "Simplified" License
61 stars 20 forks source link

Maximum file size #13

Closed ankitmodi10 closed 6 years ago

ankitmodi10 commented 6 years ago

Hello, This is an absolute great project. Thanks ! Is there any way to download file sizes greater than 2.0M ? We need it for our internal project.

Warm Regards !

centic9 commented 6 years ago

I don't think there is any such limitation in my project, so if you see only files that are smaller than that then it is likely a limit imposed by CommonCrawl itself, which means this tool cannot do anything about it as far as I see.

Please reopen with more details if this is not the case.