digitalpalidictionary / dpd-db

13 stars 8 forks source link

How to generate a "complete" database, including all sandhi, etc.? #18

Closed gambhiro closed 9 months ago

gambhiro commented 9 months ago

Starting with a clone of the tip:

git clone --depth=1 https://github.com/digitalpalidictionary/dpd-db.git
cd dpd-db
git submodule init && git submodule update
poetry install
poetry run bash bash/initial_setup_run_once.sh

What is the next step? There are several scripts with "sandhi" in them. Is there a specific order? Command line options? config.ini settings?

Is bash/build_db.sh necessary at this point?

gambhiro commented 9 months ago

See the instructions in the README:

https://github.com/digitalpalidictionary/dpd-db#build-a-complete-database-locally

Devamitta commented 9 months ago

Isn't poetry run bash bash/initial_setup_run_once Creates config.ini with default parameters?

Or if we need to change something in the default config.ini we can add in the

bash/initial_setup_run_once

Config update: deconstructor - all_texts - yes

And no point to change : make_dpd = no Because it is affecting only make_dict.sh

also no need make_deconstructor - yes, since you anyway run build_db.sh which will generate all sandhi

Please see instruction for config.ini parameters:

https://github.com/digitalpalidictionary/dpd-db/blob/main/dps/relationship.md

Devamitta commented 9 months ago

I made a bash: build_and_make_all which will create whole db and extract all dictionary for the first time. I also corrected README.

https://github.com/digitalpalidictionary/dpd-db#build-a-complete-database-locally-and-extract-all-dictionaries

gambhiro commented 9 months ago

The export formats are a separate concern from building the database, and when the compilation time is so long, exports should not be lumped together with the database.

Could you please separate these tasks? Building the db could be a direct dependency of exporting, i.e. the exporter task can run building the db.

gambhiro commented 9 months ago

I've seen the .md doc Ven @devamitta, and I'll say it's a start, but indeed it's a relationship and it's complicated. I make notes like this all the time too. At this point it is like condensed notes to yourself, not an explanation to someone else.

Regardless, it shouldn't be necessary to read about config options to run common build operations.

Are you familiar with task runners? This is what they are for, i.e. to specify (scripting) the steps for build targets, and the steps for building their dependent files or other conditions. I've mentioned using one or another to Ven @bdhrs

Such as GNU make and it's Makefile, but make has such obscure behaviours that I would avoid it for anything complex.

'doit' it Python-based, it seems popular and the syntax looks readable: https://github.com/pydoit/doit

It can execute shell commands or Python functions as actions: https://pydoit.org/tasks.html#actions

That would lend itself well to your present workflow, and allow specifying what build tasks there are, creating blocks of steps to compose the more complex tasks, what depends on what, running checks before the actions, etc.

bdhrs commented 9 months ago

Thanks for volunteering to set up a task runner, bhante.

Please make one for your specific requirements of building the db with all deconstructed sandhi compounds, and the associated config.ini settings. If it's easily readable, and gives better results than good old bash scripts then we can set up some more runners for other common situations.

I mostly run the same process daily, but occasionally need to activate some different config settings. I'm sure @Devamitta has a similar workflow.

gambhiro commented 9 months ago

Thanks for volunteering to set up a task runner, bhante.

I didn't say that, it's been my experience to not mess with your workflow 😁 And again I'm concerned that if I write it, you won't be able to adapt it for the contexts that are currently habitual for you and Ven Devamitta.

If you have specific "how do I ..." questions, that's probably where I can help.

The machine is also good at providing a head start without having to read much docs:

Using the doit Python task management tool, how to add... https://www.phind.com/search?cache=k6tbbf5lu6ixx7q2thb7jy6c

Devamitta commented 9 months ago

We are currently able to perform all these tasks by running various bash scripts. I don't understand why we should create something that serves the same purpose. Ajahn @gambhiro can you explain the benefits of investing time in something without additional functionality?

bdhrs commented 9 months ago

@gambhiro It's been the standard procedure on the main work of DPD, the Pāḷi data, to treat all complaints and criticism on the feedback forms as offers of help. This largely accounts for some of the rapid progress in specific areas of the Pāḷi database. Perhaps coders are a little more cagey than the linguists and meditators.

Are the latest updates in the bash scripts folder comprehensible? There's really not much more required than that.

gambhiro commented 9 months ago

We tried the approach that "I will quickly code it for you", and I don't think the results of that protocol were satisfactory for either of us.

I am offering to help, but instead of coding, now I am doing more talking. If the solutions I am suggesting aren't suitable for you, that's actually a useful result, don't you think? It's better than having coded it without discussion.

I raise the confusion and errors I am running into. Ven. @Devamitta responded and tidied up makdedict.sh and db build scripts. Even though it's just moving around some lines of bash, but I shouldn't have attempted it because I don't have the context information for it.

gambhiro commented 9 months ago

Regarding build tasks, the abstract purpose of a task runner is organization and clarification (and a sort of documentation).

You probably encountered the classics: make build, make clean, make install Often a project will have a handful of build targets. These collect and organize perhaps dozens of steps into purposeful end tasks. (but I'm not suggesting to use GNU make)

Currently there are several scripts in bash/ and scripts/ (why two folders anyway? are there other build scripts?) It is terribly confusing how to use this, which are build scripts, what are prep steps, etc. It also makes debugging hard. (Is an error happening b/c this isn't the right script? Or it needs a config variable? Or a prep-step before it?)

Now, you could do this with bash only, for example:

Having accumulated a bunch of build scripts, some level of organization would be helpful not only to make it comprehensible how to use parts of the system, but also how to add new targets or adapt parts of it.

gambhiro commented 9 months ago

The other thing that stood out for me is that the bash scripts are mostly a list of Python scripts. You are not doing much "bash scripting" in them.

That would lend itself to converting to .py scripts, where in main() you import the main() of the other .py scripts, or more specific task-related functions.

The added benefit is the shared environment: import and run setup functions only once, create config options, pass along sth like a limit items variable, pass inputs and outputs between build steps as function arguments, no ambiguity about poetry run ... or not, VS Code would jump between functions as usual, etc.

Anyway, food for thought.

gambhiro commented 9 months ago

@bdhrs @Devamitta the build_and_make_all.sh script is a helpful example, thank you. It looks like I will have to cook up my own version but it's instructive to see the context for it.

Re: task runners, if you don't see the benefit then that's what we learnt from the discussion, in other words you determined that it is out of scope.

As for me, I think life is too short for debugging two things: vanilla Javascript (undefined is not a function for anyone?) and bash. set -e is kinda neat until it isn't.

Say, you want to remove *.tar.gz files if they exist? If they don't exist, nothing to do, move on. But rm will exit with 1 and the build script stops. StackOverflow revealed that it's as simple as:

if stat -t "$RELEASE_DIR"/*.tar.bz2 >/dev/null 2>&1; then rm "$RELEASE_DIR"/*.tar.bz2; fi

I'm looking at xonsh to rewrite my build script, inline Python in the shell would be neat.

Or the more radical nushell looks good, it actually knows what a list is.