StampyAI / alignment-research-dataset

Stampy's copy of Alignment Research Dataset scraper
https://huggingface.co/datasets/StampyAI/alignment-research-dataset
MIT License
8 stars 7 forks source link

Books to add #39

Open plexish opened 1 year ago

plexish commented 1 year ago

Good books which seem to be missing from https://huggingface.co/datasets/StampyAI/alignment-research-dataset/viewer/gdrive_ebooks/train

The AI Does Not Hate You The Alignment Problem Smarter Than Us Artificial Intelligence Safety and Security Homo Deux

ccstan99 commented 1 year ago

@plexish Do we have access to epub, pdf, markdown or some other "open" versions of these books? We'll need some source to scrape.

plexish commented 1 year ago

I don't have them at hand, the initial task for whoever picks this up is to do a bit of searching around to see if they can be found. Maybe with https://z-lib.is/

plexish commented 1 year ago

Yeah, looks like that has some of them at least

mruwnik commented 1 year ago

how does that look from a legal perspective? Could that be a potential copyright issue?

ccstan99 commented 1 year ago

Wait! "The Alignment Problem" is already somewhere in the dataset because I've seen it cited in several generated response. Unless we've lost something in our new scrape, these were books from the 1.0 scrape from last year.

That includes all of @plexish suggestions except for "Artificial Intelligence Safety and Security" and "Homo Deux".

ccstan99 commented 1 year ago

FYI, looking at the notes again, they pulled several books from http://z-lib.org/, which looks suspiciously like your link https://z-lib.is/. FYI, this is what the site currently looks like, so looks like we definitely have some copyright issues here.

Screenshot 2023-06-18 at 12 48 08 PM

Maybe reach out to some of the authors for permission to use?

ccstan99 commented 1 year ago

On an issue separate from copyright, those epub files are in the data/raw/book_text directory and also in gdocs_ebooks.jsonl. Not sure why they do show up in the HF dataset viewer though, but they're definitely in the dataset when I download and inspect it.