Open plexish opened 1 year ago
@plexish Do we have access to epub, pdf, markdown or some other "open" versions of these books? We'll need some source to scrape.
I don't have them at hand, the initial task for whoever picks this up is to do a bit of searching around to see if they can be found. Maybe with https://z-lib.is/
Yeah, looks like that has some of them at least
how does that look from a legal perspective? Could that be a potential copyright issue?
Wait! "The Alignment Problem" is already somewhere in the dataset because I've seen it cited in several generated response. Unless we've lost something in our new scrape, these were books from the 1.0 scrape from last year.
That includes all of @plexish suggestions except for "Artificial Intelligence Safety and Security" and "Homo Deux".
FYI, looking at the notes again, they pulled several books from http://z-lib.org/, which looks suspiciously like your link https://z-lib.is/. FYI, this is what the site currently looks like, so looks like we definitely have some copyright issues here.
Maybe reach out to some of the authors for permission to use?
On an issue separate from copyright, those epub files are in the data/raw/book_text directory and also in gdocs_ebooks.jsonl. Not sure why they do show up in the HF dataset viewer though, but they're definitely in the dataset when I download and inspect it.
Good books which seem to be missing from https://huggingface.co/datasets/StampyAI/alignment-research-dataset/viewer/gdrive_ebooks/train
The AI Does Not Hate You The Alignment Problem Smarter Than Us Artificial Intelligence Safety and Security Homo Deux