Write a Dart Script Inputs VRI Tipitaka XML Files and Outputs SQL Files For Inserting Into SQLite DB

iulspop commented 10 months ago

Original request from Bhante Subhuti:

Yes.. let me show you one.. I think we have this book.. but it does not matter.. This is the xml format. https://github.com/VipassanaTech/tipitaka-xml/blob/main/romn/e0201n.nrf.xml

We need to make an sql txt import file that writes to the pages categories books and tocs tables

You can look at the working from the file that is installed from the app store.. To find the db.. you just go to settings/helpaboutetc/reset data DO NOT RESET. The db directory will be shown there.

There is also an immediate need to add a "simple" field to the current books that are extensions. You can start with that.. the IIT chanting book.

The simple field will have the diacritical characters removed. (leave ñ alone) I can give you dart code that makes the simple

There are 2 ways it can be done.. One is just modifying our sql file.. The other option is to modify the program that made the files. 2nd method is preferred.

This book has fake pages.. made up.. they cannot be too big .. nor too small or else we get performance issues.

For the xml, you will need to match the codes.. we want to keep codes that tell us the page, alt readings, and other book pages and paragraphs.

Start with this. You can use the sqlite db browser to see if the imports work. You can delete based on bookid delete from pages where bookid=xyz

see attached sql The attached sql will show you the format we want. The job for this sql and several others.. is to change the code that generated it.. add an extra field on the toc inserts. I'll send you that code later.

Bhante Subhuti

Here's is what I've understood the task is:

Write a Dart script which processes each XML file of the Roman script version of the Tipitaka provided by VRI and outputs an SQL file for importing the book into the SQLite DB.

Each XML file maps to a book in the books table, each book is related to one category in the category table, each book is related to many pages in the pages table,and each book is related to a toc in the tocs table.

"We need to make an sql txt import file that writes to the pages categories books and tocs tables"

You do not mention the paragraphs table here, should that be written to as well?

There is also an immediate need to add a "simple" field to the current books that are extensions. You can start with that.. the IIT chanting book. The simple field will have the diacritical characters removed. (leave ñ alone) I can give you dart code that makes the simple

I would add the "simple" field to what table? the pages table? It has the "content" text field.

For the xml, you will need to match the codes.. we want to keep codes that tell us the page, alt readings, and other book pages and paragraphs.

What specific XML codes are you referring to? In the XML files all I see are paragraph numbers like:

<p rend="bodytext" n="8"><hi rend="paranum">8</hi><hi rend="dot">.</hi> (Ka) dassanena pahātabbā dhammā.</p>

I don't see information about pages, alt readings, and other book pages and paragraphs.

There are 2 ways it can be done.. One is just modifying our sql file.. The other option is to modify the program that made the files. 2nd method is preferred. The job for this sql and several others.. is to change the code that generated it.. add an extra field on the toc inserts. I'll send you that code later.

Where is the program that made the SQL files for importing the books like the one you shared?

The SQL file for importing the chanting book you shared: iit_chantingbook.sql.txt

Log Of What I've Investigated So Far (These are more notes to myself)

I downloaded the app "Tipitaka Pali Reader" from the App Store on my desktop MacOS, and found the SQLite .db file at /Users/iulspop/Library/Containers/org.americanmonk.tpp/Data/Documents/tipitaka_pali.db.

I also learned I can clone this repo and run:

cd tipitaka-pali-reader/assets/database
gdown 1II8XYSQw0JzZxJk2J4QT9XyN2SnqT9qm
unzip tipitaka_pali.zip

To download the unsplit tipitaka_pali.db file.

I downloaded the DB Browser for SQLite to explore the schema in a GUI.

I then look at the structure of the VRI .xml files

It looks like for each of the seven Abhidhamma Piṭaka books there's a .att.xml file for the "aṭṭhakathā" or commentary, an .tik.xml file for the "mūlaṭīkā" or sub-commentary, and a .mul.xml file for the book.

I don't understand what the .nrf.xml files are. Some are anuṭīkā texts, which I think means "sub-sub-commentary"? Others are not from the "Abhidhammapiṭake" nikaya but from other nikaya like "Abhidhammāvatāra-purāṇaṭīkā", or don't have a nikaya attribute at all but only a book title like "Abhidhammatthasaṅgaho". I suppose their additional texts not part of the Pali Canon?

I found this "Essence of the Tipitaka" document by VRI a good reference for understanding what texts these various .xml files refer: https://www.tipitaka.org/eot

I'm starting to see a structure.

There's an abh series of XML files which contain the Abhidhamma Piṭaka, it's commentaries and sub-commentaries and additonal related texts.

There's a e series of files that seems to be extra Pali books outside the Tipiṭaka.

There's an s series of files that are part of the Sutta Piṭaka and it's commentaries and sub-commentaries.

Then there's a vin series of files that are part of Vinaya Piṭaka.

The XML files have these elements (haven't gotten a comprehensive list yet): head dix p paragraph pb page break teiHeader text hi highlighted note

p elements often have a rend attribute, like: centre nikaya title book subsubhead gatha1 gathalast subhead bodytext indent gatha2 gatha3 chapter unindented hangnum

bksubhuti commented 10 months ago

You have made some great progress all on your own. There is no need for paragraphs table. There is a more immediate need to make extensions for the books we are missing. I'm not sure If I asked this with you or another Lao monk. Putting here is a good idea. We are missing some books that are found in vir and also tipitaka.app
It is a priority and a good way to practice, to get these books working as an extension. They are independent from linked books and if you get the pages wrong or off by one.. it does not matter so much.

The simple field in toc is no longer used. I will remove that.
The initial query is small enough that we can get all toc items for a book and then filter locally.

It would be good to do a call on google meet.

iulspop commented 10 months ago

Missing anna books: all Saṃgāyana pucchā all ledi sayadaw all buddhavandana all vansagatha from grammar Bālāvatāra all nithigantha all pakinaka gantha all sinhala gantha

Prioritize a Saṃgāyana pucchā book or a missing Ledi Sayadaw book

I'll start with "Patanudessa"

Get it into our system is prioriorty, focus on Myanmar paragraph numbers and real pages.

iulspop commented 10 months ago

reorganize tpr_downloads to have a release dir where we put .zip with sql file for importing the anna texts (later whole VRI Tipitaka)

iulspop commented 10 months ago

TODOS

Iuliu:

Look at Patanudessa book, write down input and expected output, then verify with Bhante Subhuti the expected output is correct

Bhante Subhuti:

Get the codes for ALT readings, page and paragraph numbers in various editions

bksubhuti commented 10 months ago

I sent request to janaka for the vri codes sent a request to a monk to give the name for a priority book. Added you to collaborate on tpr_downloads.

bksubhuti commented 9 months ago

Anudīpanīpāṭha was suggested. https://tipitaka.org/romn/cscd/e0401n.nrf0.xml

and if you want to do sanghayana pucca , you can also do that. First book of first folder is here. https://tipitaka.org/romn/cscd/e0901n.nrf0.xml

You can choose which one .. probably the ledi sayadaw book will be easier.

bksubhuti commented 9 months ago

The message I got back was this..

"I don't think the codes are documented anywhere. At least I haven't seen. also I don't think the codes are too complicated to understand studying oneself which is what I did. when you go through looking at the XML file you will intuitively understand what the codes mean. of course if he has any questions I would be happy to answer as well. The problem with making documentation is that I will have to go through a file and try to understand them again since I have forgotten all of it. So it is best to ask questions when you have and I will be happy to answer."

He is on facebook under the name Path Nirvana So you can send him a message if you need help. I think you might be able to leave them.."as is" .. we can see later. The code investigation would be better with a mula book.. for instance majjhima nikaya. There will be different versions for page numbers and different alt readings. You can find that by ctrl click on tipitaka.org (and go to the website that displays the pali) and then match it with the github link.

I think this is the link here

bksubhuti commented 9 months ago

by comparing mn 1. I think..the alt readings has a note tag.

and the books should be aligned in the beginning .. so we should know.. paranum.. and 3 books. I'll ask what the letters are but it should not matter.. and we probably have the same code pasted verbatim.

iulspop commented 9 months ago

Hi @bksubhuti, I read more carefully the TPR Downloads repo and I understand now that all of the SQL files there are for importing extensions but not the main Tipitaka texts. I wonder how were the current Tipitaka texts available in TIptitak Pali Reader imported? Are there SQL files or code used to generate SQL files for them available anywhere? Or were they imported manually somehow? Or maybe you reused the already loaded database from the Myanmar only app?

In any case, I dumped the current DB data to SQL to get started. I followed up on this issue in this PR draft: https://github.com/bksubhuti/tpr_downloads/pull/2. Let's continue the conversation there.

bksubhuti commented 9 months ago

I will forward a message to @pndaza . He is more familiar the format. Hopefully he can comment and answer your questions. It is important or critical to have the page breaks match his page breaks, especially for the main texts. The main texts are a great tool for a learning exercise rather than the añña books which don't have links in them.

The original design made several years has the three top level categories hard-coded. It has caused some issues with searching and we would like to fix this. I thought we had an issue to fix this, but I cannot seem to find it. If you go to book_list_page.dart you will find the correct codes for the topmost level categories. I'm going to breakfast now.. but I think you are exceeding my knowledge of the texts now. Great job. I'll try to send ven pndazza to your PR and also merge this. You have direct access as well to push.

bksubhuti commented 1 month ago

Note to self and update on progress: All books are can be imported with sql. Need to break up the sections and make zipped extensions. Ledi Sayadaw section is set to be finished next weekend (July 14). The sql script should take care of multiple installs without duplicates. (need to delete previous instances, "if exist") category book pages and tocs ( or conditionally add the category). The sql script should also remove previous ledi sayadaw books from the annya section that are not grouped under this new heading (if exists). book pages and tocs

bksubhuti / tipitaka-pali-reader

Write a Dart Script Inputs VRI Tipitaka XML Files and Outputs SQL Files For Inserting Into SQLite DB #219

Here's is what I've understood the task is:

Log Of What I've Investigated So Far (These are more notes to myself)