digitalpalireader / digitalpalireader

A downloadable web application for immersive study of the Pāli language and the Tipitaka.
https://www.digitalpalireader.online/
Other
35 stars 20 forks source link

Glitches in DPR texts with āa and aā #288

Open bdhrs opened 3 years ago

bdhrs commented 3 years ago

There are some mixups in the texts with compounds containing aaa i.e. āa or aā.

DPR renders everything uniformly as āa

Sometimes this is correct e.g. mahāaggikkhandho

and sometimes incorrect e.g. bhedāapatti, majjhimāagame, where it should be aā.

There are only a few words in the vinaya and sutta piṭaka which contain this particular compound, easy enough to repair by hand.

āa mula

but thousands in the abhdidhamma and commentaries.

āa tipitaka

AC:

rrogowski commented 3 years ago

Venerable @bdhrs, @parthopdas: I took a look and I think this is a problem with the underlying XML data that is analyzed by the DPR. For example, bhedāapatti appears in the XML for Book 4 of the Aṅguttara (Mūla): https://github.com/digitalpalireader/digitalpalireader/blob/master/tipitaka/my/a4m.xml#L2425, whereas it should be bhedaāpatti. How were the XML files generated?

I don't think this is a bug with the way DPR is displaying results, as I am able to search for words that contain : https://www.digitalpalireader.online/_dprhtml/index.html?feature=search&type=0&query=\wa%C4%81\w&MAT=m&set=n&book=1&part=1&rx=true. Please let me know if I have misunderstood the original issue!

parthopdas commented 3 years ago

@rrogowski afaik it was a custom process used by v @yuttadhammo years back. i think the VRI corpus was the base.

you are most likely right in your assessment and if so we'll need to fix the xml files with the caveat in the PS. do you know what the corresponding VRI texts say? https://tipitaka.org/romn/

i think this is a good discussion for the dpr channel.

PS: this is item 6 in https://discord.com/channels/780067275008376862/786141053090660362/822972211711180801

rrogowski commented 3 years ago

Just want to note that, per our discussion on Discord, we will be using the VRI as the source of truth for the DPR Burmese texts. By tackling this long-term problem, we will fix the short-term problem described in this issue. The next steps seem to be:

parthopdas commented 3 years ago

@rrogowski I will also add that please make a judgment call on this.

I certainly prefer we fix this problem once and for all as it opens a bunch of possibilities and aligns well with overall DPT roadmap.

However you're doing the work and I'd rather you do stuff that interests you than chase some random vision / roadmap.

All work you do in DPR is impactful by definition.

rrogowski commented 3 years ago

@parthopdas I'm happy to continue working on this issue!

I manually browsed each commit in the Git history for the Myanmar Tipitaka XML files (there were only a couple dozen total). Of these commits, the following contain Pali fixes.

Apr 20, 2020

Aug 28, 2014

Aug 25, 2012

Sep 23, 2011

Sep 3, 2011

There are few enough typos that I think we can fix them manually in the VRI texts. So here's what I was thinking about doing from this point forward:

  1. Submit a PR to VRI fixing the typos identified in the commits above. (I'm assuming we will want corrections to be pushed here moving forward, so that we can maintain a single source of truth. Does that seem right? In turn, the XML files for the DPR will be auto-generated from the VRI texts.)
  2. Begin working on a script to generate the Myanmar Tipitaka XML files for the DPR from the VRI texts. Ideally, the auto-generated results should closely match the existing DPR XML files in their current state, with the exception of known discrepancies such as confusing āa and .
  3. Create a PR with the regenerated XML files, which would in turn resolve this short-term issue.
  4. (Future, potentially) Remove the Myanmar XML files from the DPR source code and generate them as part of the install / build process. Add the auto-generated directory to .gitignore. This will help prevent the case where changes are accidentally made to the DPR XML files and not the VRI texts (single source of truth).

What do you think?

bdhrs commented 3 years ago

Two of the recurring answers when I ask Sri Lankan and Burmese monks, "Why don't you use DPR?" are:

  1. not a good transliteration (that's been solved recently with the 17 scripts)
  2. incomplete set of texts

What you are suggesting would sort out the main problem.

As far as single source of truth goes, that must be VRI repository <snipped by partho, details in private on discord>

On Sat, 3 Apr 2021 at 04:01, Roman Rogowski @.***> wrote:

@parthopdas https://github.com/parthopdas I'm happy to continue working on this issue!

I manually browsed each commit in the Git history for the Myanmar Tipitaka XML files https://github.com/digitalpalireader/digitalpalireader/commits/95de827e624a5c41ad07a8552939cd366a95d43b/DPRMyanmar/content/xml (there were only a couple dozen total). Of these commits, the following contain Pali fixes.

Apr 20, 2020

Aug 28, 2014

Aug 25, 2012

Sep 23, 2011

Sep 3, 2011

There are few enough typos that I think we can fix them manually in the VRI texts. So here's what I was thinking about doing from this point forward:

  1. Submit a PR to VRI fixing the typos identified in the commits above. (I'm assuming we will want corrections to be pushed here moving forward, so that we can maintain a single source of truth. In turn, the XML files for the DPR will be auto-generated from the VRI texts.)
  2. Begin working on a script to generate the Myanmar Tipitaka XML files for the DPR from the VRI texts. Ideally, the auto-generated results should closely match the existing DPR XML files in their current state, with the exception of known discrepancies such as confusing āa and aā.
  3. Create a PR with the regenerated XML files, which would in turn resolve this short-term issue.

What do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/digitalpalireader/digitalpalireader/issues/288#issuecomment-812741094, or unsubscribe https://github.com/notifications/unsubscribe-auth/APMIMA724ZCGVNOROYRIR7TTGZAUDANCNFSM4TR2CAPA .