PeacefulScience / peacefulscience.org

hugo website
0 stars 7 forks source link

Add in posts for all referenced books to "books/" #7

Closed swamidass closed 2 years ago

swamidass commented 2 years ago

I can provide a list of html links to books. But it will be an ongoing project to get this done right. Long term, a script to scrape the required information would be ideal. The hard part will be scraping amazon, because it requires some specially headers to enable mining.

swamidass commented 2 years ago

@madroxdupe42 you have the list now. Let me know if you need anything more from me regarding it, and be sure to let me know about any ambiguous cases.

swamidass commented 2 years ago

Also, there were 150, so it may not be a bad idea to build a script to scrape some or all of that info. It requires some finagling to get around their anti-scraping. See here for some info: https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-selectorlib/.

Alternatively, they do have an API you could register for, and hopefully it has all the right info.

Another possibility is this site, which claims to convert asin to isbn and reverse: https://www.synccentric.com/features/isbn-to-asin/

I also reccomend the "Editions" function of https://pypi.org/project/isbnlib/ for collecting the relevant isbn editions of a book. It also has some helpful function in there too.

madroxdupe42 commented 2 years ago

I agree that a script is a sensible way to go given the volume. Thanks for the links to the research you've already done on the topic. I'll see what I can do.

swamidass commented 2 years ago

If you make a script, aim for using python, and put it into the scripts directory of the project, and making it usable enough. If it is robust enough, I'll link it into build system.

swamidass commented 2 years ago

First batch looks like its done. I'll send you an updated list of books, and also get a system for you to check at any time what the unresolved books are. The goal will be to keep updating them, hopefully within a few days of a new reference being added.