funderburkjim / kosha-dev

Develop xml and html for anekArthaka and samAnArthaka Sanskrit dictionaries
1 stars 1 forks source link

Note to self - Understanding the file updation process #8

Closed drdhaval2785 closed 1 year ago

drdhaval2785 commented 1 year ago

This is a note to @drdhaval2785. Just noting it here, so that my step by step understanding and thinking out loud is not lost. I am dropping this idea of changing XML file drastically.

Understand what is happenning

I would try to understand the changes which are happenning to a dictionary from the file anhk1.txt to the harsa.txt, harsa.xml and harsa.sqlite. I have a flimsy knowledge of how this pipeline works in other Cologne dictionaries. I will explore and document step by step what is the input, what is the output and what is the script. This will help me get a better hang of what to do to minimize the duplication in every possible manner.

Input file - anhk1.txt

This file is annotated version of anekārthanāmamālā of Harśakīrti. https://github.com/sanskrit-lexicon/COLOGNE/issues/405#issuecomment-1471292298

Format

<L>1<pc>140
<k1>क-पुं<meanings>सूर्य,वेधस्
<k1>क-क्ली<meanings>सुख,मस्तक,जल
<k1>श्लोक-पुं<meanings>अनुष्टुभ्,यशस्
<k1>लोक-पुं<meanings>भुवन,जन
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥
<LEND>

harsa.txt

This file is SLP1 version of ankh1.txt This is generated by prep/harsa/redo.sh file. Workhorse is prep/harsa/convert.py file.

Format

<L>1<pc>140
<k1>ka-puM<meanings>sUrya,veDas
<k1>ka-klI<meanings>suKa,mastaka,jala
<k1>Sloka-puM<meanings>anuzwuB,yaSas
<k1>loka-puM<meanings>Buvana,jana
sUrye veDasi vAyO kaH kaM suKe mastake jale .
anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..
<LEND>

harsa.txt to csl-orig

csl-orig is the place where Cologne stores its dictionary data.

Generate local displays

cd csl-pywork
sh generate_dict.sh harsa  ../apps/harsa

This code does three things.

  1. Generate the local copies of orig / pywork / websanlexicon in apps/harsa directory.
  2. Regenerate the headword details in pywork directory by running redo_hw.sh
  3. Regenerate the XML details in pywork directory by running redo_xml.sh

They require further examination.

The log generated in the process is quite helpful to understand what is happenning under the hood.

[create@dhaval-pc csl-pywork]$ sh generate_dict.sh harsa  ../apps/harsa
BEGIN generate_orig.sh harsa ../apps/harsa
END generate_orig.sh harsa ../apps/harsa

BEGIN generate_pywork.sh harsa ../apps/harsa
END generate_pywork.sh harsa ../apps/harsa

BEGIN generate_web.sh harsa ../apps/harsa
generate web code for dictionary harsa to /home/create/Documents/projects/drdhaval2785/kosha-dev/v1/apps/harsa
END generate_web.sh harsa ../apps/harsa

*****************************************************
BEGIN execution of pywork code at ../apps/harsa/pywork
BEGIN redo_hw.sh
construct xxxhw.txt
BEGIN hw.py
BEGIN hw.py init_entries
934 lines read from ../orig/harsa.txt
112 entries found
END hw.py init_entries

BEGIN write_entries
1260 lines written to harsahw.txt
END write_entries
END hy.py

BEGIN hw2.py
END hw2.py
BEGIN hw0.py
END hw0.py
END  redo_hw.sh
regenerate harsa.xml and postxml files
BEGIN redo_xml.sh
construct harsa.xml...
make_xml.py BEGINS !!!!!
All records parsed by ET
\nxmllint on harsa.xml...
SKIPPING xmllint validity check
\nharsa.sqlite...
BEGIN pywork/redo_postxml.sh
/home/create/Documents/projects/drdhaval2785/kosha-dev/v1/apps/harsa/pywork
cp harsaheader.xml ../web/

BEGIN sqlite
remaking harsa.sqlite from ../harsa.xml with python...
sqlite.py: dictionary code= harsa
create_index takes 0.13 seconds
1265 lines read from ../harsa.xml
1260 rows written to harsa.sqlite
0.42 seconds for batch size 10000
moving harsa.sqlite to web/sqlite/
END sqlite
1260 records read from ../harsa.xml
1256 records written to query_dump.txt
END redo_xml.sh

generate_dict.sh:  NOT preparing downloads
/home/create/Documents/projects/drdhaval2785/kosha-dev/v1/apps/harsa/pywork

redo_hw.sh

This uses three scripts hw.py, hw2.py and hw0.py and generates three output files harsahw.txt, harsahw2.txt and harsahw0.txt.

hw.py

Reads two input files harsa.txt and harsa_hwextra.txt (currently blank). Generates harsahw.txt.

Format of harsahw.txt

<L>1<pc>140<k1>ka<k2>ka<ln1>53<ln2>60
<L>1<pc>140<k1>sUrya<k2>sUrya<ln1>53<ln2>60
<L>1<pc>140<k1>veDas<k2>veDas<ln1>53<ln2>60
<L>1<pc>140<k1>suKa<k2>suKa<ln1>53<ln2>60
<L>1<pc>140<k1>mastaka<k2>mastaka<ln1>53<ln2>60
<L>1<pc>140<k1>jala<k2>jala<ln1>53<ln2>60
<L>1<pc>140<k1>Sloka<k2>Sloka<ln1>53<ln2>60
<L>1<pc>140<k1>anuzwuB<k2>anuzwuB<ln1>53<ln2>60
<L>1<pc>140<k1>yaSas<k2>yaSas<ln1>53<ln2>60
<L>1<pc>140<k1>loka<k2>loka<ln1>53<ln2>60
<L>1<pc>140<k1>Buvana<k2>Buvana<ln1>53<ln2>60
<L>1<pc>140<k1>jana<k2>jana<ln1>53<ln2>60

Here, L stands for lnum, pc for page-column, k1 for key1, k2 for key2, ln1 for the starting line of the entry in harsa.txt and ln2 for the ending line of the entry. This would mean that entry with lnum 1 starts from 53 and ends at 60. Note that in this file, the line 53 is the metaline starting with <L>, and 60 is the metaline ending with <LEND> to mark the end of the entry. Thus, this is inclusive of the metalines.

hw2.py

Reads harsahw.txt file as input. Generates harsahw2.txt file as output.

Format

140:ka:53,60:1
140:sUrya:53,60:1
140:veDas:53,60:1
140:suKa:53,60:1
140:mastaka:53,60:1
140:jala:53,60:1
140:Sloka:53,60:1
140:anuzwuB:53,60:1
140:yaSas:53,60:1
140:loka:53,60:1
140:Buvana:53,60:1
140:jana:53,60:1

The entries are in pc:k1:ln1:ln2:lnum format.

hw0.py

Reads harsahw2.txt file as input. Generates harsahw0.txt file as output. In the present case harsahw2.txt and harsahw0.txt are identical, as there is no differences between key1 and key2 in these Sanskrit koshas.

redo_xml.sh

This script generates harsa.xml by using harsa.txt and harsahw.txt with the help of make_xml.py script

make_xml.py

This is the most important workhorse of the whole process. This generates the XML file from TXT file.

harsa.xml

Format

<H1><h><key1>ka</key1><key2>ka</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>sUrya</key1><key2>sUrya</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>veDas</key1><key2>veDas</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>suKa</key1><key2>suKa</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>mastaka</key1><key2>mastaka</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>jala</key1><key2>jala</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>Sloka</key1><key2>Sloka</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>anuzwuB</key1><key2>anuzwuB</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>yaSas</key1><key2>yaSas</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>loka</key1><key2>loka</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>Buvana</key1><key2>Buvana</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>
<H1><h><key1>jana</key1><key2>jana</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>

<h> tag holds key1, key2. <body> holds <hwdetails> and <entrydetails> <tail> holds L and pc. hwdetails is a list of hwdetail (holding hw-gender, meaning pair). entrydetails is a list of verses.

I should stop at this juncture and analyse the information being captured in harsa.xml.

Flaws in harsa.xml

The following is the original information

<L>1<pc>140
<k1>ka-puM<meanings>sUrya,veDas
<k1>ka-klI<meanings>suKa,mastaka,jala
<k1>Sloka-puM<meanings>anuzwuB,yaSas
<k1>loka-puM<meanings>Buvana,jana
sUrye veDasi vAyO kaH kaM suKe mastake jale .
anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..
<LEND>

This shows that there are four headwords with their associated meaning. Ideally when I search for sUrya, I should get only the following information.

<k1>ka-puM<meanings>sUrya,veDas
sUrye veDasi vAyO kaH kaM suKe mastake jale .
anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..

But the present information of sUrya in harsa.xml is as shown below.

<H1><h><key1>sUrya</key1><key2>sUrya</key2></h><body><hwdetails><hwdetail><hw><s>ka-puM</s></hw><meaning><s>sUrya,veDas</s></meaning></hwdetail><hwdetail><hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail></hwdetails><entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails></body><tail><L>1</L><pc>140</pc></tail></H1>

Flaw 1. One can see that there is superfluous inclusion of <hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail>.

Flaw 2. One can see that there is copying of the entrydetails in all 12 headwords, unnecessarily.

<entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails>

Because of these duplications, the file size increases dramatically. harsa.txt of 32 kb gets bloated to 810 kb harsa.xml i.e. almost 25 times increment. It is OK for small lexica such as this, but there are lexica around 2 MB in size. The bloat would be very high.

proposed new harsa1.xml

<hwdetails>
<h><key1>ka</key1><key2>ka</key2><L>1</L><eid>1</eid><pc>140</pc></h>
<h><key1>sUrya</key1><key2>sUrya</key2><L>1</L><eid>1</eid><pc>140</pc></h>
<h><key1>veDas</key1><key2>veDas</key2><L>1</L><eid>1</eid><pc>140</pc></h>
<h><key1>ka</key1><key2>ka</key2><L>1</L><eid>2</eid><pc>140</pc></h>
<h><key1>suKa</key1><key2>suKa</key2><L>1</L><eid>2</eid><pc>140</pc></h>
<h><key1>mastaka</key1><key2>mastaka</key2><L>1</L><eid>2</eid><pc>140</pc></h>
<h><key1>jala</key1><key2>jala</key2><L>1</L><eid>2</eid><pc>140</pc></h>
<h><key1>Sloka</key1><key2>Sloka</key2><L>1</L><eid>3</eid><pc>140</pc></h>
<h><key1>anuzwuB</key1><key2>anuzwuB</key2><L>1</L><eid>3</eid><pc>140</pc></h>
<h><key1>yaSas</key1><key2>yaSas</key2><L>1</L><eid>3</eid><pc>140</pc></h>
<h><key1>loka</key1><key2>loka</key2><L>1</L><eid>4</eid><pc>140</pc></h>
<h><key1>Buvana</key1><key2>Buvana</key2><L>1</L><eid>4</eid><pc>140</pc></h>
<h><key1>jana</key1><key2>jana</key2><L>1</L><eid>4</eid><pc>140</pc></h>
</hwdetails>

<entrydetails>
<entry>  
<L>1</L>
<hwmeanings>
    <hwms><eid>1</eid><hw>ka-puM</hw><meanings>sUrya,veDas</meanings></hwms>
    <hwms><eid>2</eid><hw>ka-klI</hw><meanings>suKa,mastaka,jala</meanings></hwms>
    <hwms><eid>3</eid><hw>Sloka-puM</hw><meanings>anuzwuB,yaSas</meanings></hwms>
    <hwms><eid>4</eid><hw>loka-puM</hw><meanings>Buvana,jana</meanings></hwms>
</hwmeanings>
<body>
    <s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s>
</body>
<pc>140</pc>
</entry>
</entrydetails>

Here, eid is extra id, which is used to identify the hw-meaning pair in anekArthaka koshas. Similarly that may be used in samAnArthaka koshas. eid will be continuous througout the file, so that it is possible to tag to internally refer to some samAnArthaka group or anekArthaka group in the dictionary / from other dictionaries.

If we search for a word in the headword in hwdetails section, we will be able to get L and eid for the searched word. We can use the same to get the entry from entrydetails section. We can display the relevant eid only. The entry is shown indented just for the sake of readability. Otherwise, it would be on as single line.

As, this format reduces the problems of duplication, I will try to explore this. Otherwise, we will stick to the format used by Jim.

redo_postxml.sh

  1. Regenerate the sqlite file in pywork/sqlite directory, for storing.
  2. Regenerate query_dump.txt file in pywork/webtc2 directory, for search in case of advanced search.

Generate sqlite file

pywork/sqlite/sqlite.py is the script.

xxx.sqlite is an sqlite table with the name 'dictcode'. It has a table with three items in each row. key, lnum and line. row = (key, lnum, line) tuple is curated from xxx.xml file. These rows are put inside the sqlite file in a default batch of 10000 entries.

Generate query_dump file

pywork/webtc2/init_query.py is the script.

It generaes query_dump.txt file from xxx.xml file. It is used for advanced search.

funderburkjim commented 1 year ago

@drdhaval2785 If you decide to attempt to 'reduce the bloat', suggest you do it in v02.

One thing causing the bloat is the long tag-names (entrydetails, etc)

drdhaval2785 commented 1 year ago

Right. I am doing this in v2 only.

drdhaval2785 commented 1 year ago

Your comment regarding long tag names causing bloat is quite correct. After reducing the bloat, the file size of 554.7 kb of harsa.xml gets reduced to 227.5 kb of harsa1.xml file without any information loss.

drdhaval2785 commented 1 year ago

created a file harsa1.xml, which is compact in nature with minimal duplication. The reduction is almost half. I have not pushed the changes yet.

I think that there is no fun trying to complicate the downstream applications like sqlite.py and webtc2 logic to take care of this randomness. Having an alternate XML suffices for my purpose. The Cologne workflow can continue as usual. The file harsa1.xml would be an additional resource, if I want to use it in stardict or other places.