A preprocessing script - Githubissues

ambuda-org / vidyut

Infrastructure for Sanskrit software. For Python bindings, see `vidyut-py`.

48 stars 21 forks source link

A preprocessing script #87

Closed MSSRPRAD closed 6 months ago

MSSRPRAD commented 6 months ago

Just a simple script to Make the data uniform. You might want to run it once on the data @akprasad

It ensures:

Samavrtta have 1 pada
Ardha-Samavrtta have 2 padas
Others have 4 padas
Duplicates are identified and removed

Creates a file processed_vrttas.tsv and modifies no other files.

Also thought it would be interesting to see how many schemes in the data are 'Loose Duplicates', or, the same ignoring the last syllable of each pada.

So, I am Printing the names of both Duplicates and Loose Duplicates.

python3 ./scripts/preprocess.py ./data/vrtta.tsv

akprasad commented 6 months ago

Thanks for this! Simplifying the data format like this is a great idea. My one concern is whether the file size will bloat if using this client-side, but maybe that's a non-concern given that client-side assets will be gzipped.

The only immediate change I see is that for ignore_last_syllable, we probably want to append X as an "any" character rather than dropping it completely.

I'll fold this into my current commit once the Rust code is a bit more stable. I'm also trying to set up a simple demo for mailing lists, etc.

akprasad commented 6 months ago

I decided to keep the data format as-is so that the data is easier to visually inspect, e.g. seeing G is easier to identify as a samavrtta than G/G/G/G.

For now, I also switched the data source to the Vrttaratnakara, since I'm not sure where the bigger list came from (I know it's from Dr. Dhaval Patel, but I'm not sure where past that) and want to keep the data here clean even if it's smaller.