OpenPecha / rag_prep_tool

MIT License
0 stars 0 forks source link

RAG0001: Preprocess Documents (2) #1

Closed tenzin3 closed 4 months ago

tenzin3 commented 4 months ago

Description:

Extracting meta data of the clean extracted text of books (Dalai Lama's).

Meta Data list:

Expected Output:

json file with format given below.

Image

Implementation Plan

Image

Work Items

tenzin3 commented 4 months ago

We need to select one more book with following requirements

Most of the pdf version and clean text file are presented in the github/MonlamAI. Below is a list of dalail lama books that has a clean version text.(Not sure if pdf version is also available or not.) list of dalai lama books

tenzin3 commented 4 months ago

chose Ethics for the New Millennium for the second book.