bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
75 stars 49 forks source link

Create dataset bloom_library #198

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
cccntu commented 2 years ago

I don't see a direct link to download all of the data at once. Should we write code to crawl it, or contact the site owner first? similar situation: #230 #241

albertvillanova commented 2 years ago

Hi @cccntu,

Yes, I agree: no direct link seems to exist

Moreover, crawling their site is forbidden by their terms of use: https://bloomlibrary.org/page/termsOfUse

Acts Against the Site/Services 

...
(f) using manual or automated software, devices, scripts, robots, or other means or processes to access, "scrape," "crawl," or "spider" any pages contained in the Site;

I guess we should contact site owners first to ask for permission.

CC: @yjernite

apergo-ai commented 2 years ago

self-assign

apergo-ai commented 2 years ago

request for permission sent

albertvillanova commented 2 years ago

Thanks, @apergo-ai. Any feedback on this? Did you contact Daniel Whitenack from SIL?

apergo-ai commented 2 years ago

They are working on it. I'll check