KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
37 stars 3 forks source link

Downloading books from Tamil Nadu government site and uploading it to Internet archive #97

Open balajijagadesh opened 4 years ago

balajijagadesh commented 4 years ago

Tamil Nadu government has uploaded the museum related books released by Tamil Nadu government in the website

http://www.e-books-chennaimuseum.tn.gov.in/chennaimuseum/index.php?option=com_content&view=article&id=18&Itemid=116

Here the books are uploaded in alphabetical order.

Need to identify the structure of the url.

Download all the books locally with relevant Meta data.

Then upload it with in Internet archive with the license creative commons cc by sa as per this government order. While uploading the books need to be uploaded with proper meta data for easy access in the future. Also can explore the possibility of adding an ocr layer to the pdf before uploading.

https://commons.wikimedia.org/wiki/File:GoTN_Tamil_Development_Departments_order_on_creative_commons_cc_by_sa.pdf

Once uploaded into internet archive, then it can be easily transferred to commons.wikimedia.org using the tool

https://tools.wmflabs.org/ia-upload/

in the later stage.

muthu1809 commented 3 years ago

I believe all books from Chennai Museum are already present in archive.org. https://archive.org/details/@malamud did this already, it seems. I randomly checked few book titles from Chennai Museum and verified in archive.org and they are present. Kindly check. If some more tasks to be done on this, Tamilvelan (Payilagam Python trainee) was earlier asked to do this. His code is present here - https://tamilvelanpython.wordpress.com/2020/07/06/web-scraping-project/