Hedera-Lang-Learn / hedera

MIT License
10 stars 2 forks source link

Update Latin Import Data from DEV #453

Closed arthurian closed 1 year ago

arthurian commented 1 year ago

This PR refreshes the import data based on changes made in the DEV database in preparation for doing a final deployment to PROD. Once deployed to PROD, all new data curation should take place there (not DEV).

Summary

Changes

@bilbe Can you spot check to be sure the changes look right?

Notes

Update Process

I didn't create a script to do the updates, although one could easily do that if this process needs to be repeated. Since the data files map one-to-one with tables, I followed this process (roughly):

  1. Run SQL against each table and export the results to separate TSV files. This can be done in DataGrip or your choice of DB IDE to export the results of a query (which is what I did), or you can do it using the psql client with the COPY command.
  2. Sort and then diff the old and new TSV files to see what has changed and sanity check (e.g. should be mostly additions).
  3. Replace the old TSV file with the new one, ensuring column headers are the same and then commit to git.

Example steps to update form_to_lemma.tsv

  1. Export results of this query to TSV:
SELECT f.form, l.lemma
FROM lemmatization_lemma l
         JOIN lemmatization_formtolemma f ON l.id = f.lemma_id
WHERE l.lang = 'lat'
ORDER BY f.form, l.lemma;
  1. Sort the old TSV file since it was not already sorted:
cat form_to_lemma.tsv | sort | sponge form_to_lemma.tsv

The caveat with the above is that we don't want to include the header line in the sort, so you can either remove and add back manually, or do something like this:

tail -n+2 form_to_lemma.tsv | sort | sponge form_to_lemma.tsv
echo "FORM\tLEMMA" | cat - form_to_lemma.tsv | sponge form_to_lemma.tsv
  1. Diff the old and new to sanity check:
diff -us form_to_lemma.tsv new_form_to_lemma.tsv
  1. Replace with the new one if it looks good and commit the changes:
mv new_form_to_lemma.tsv form_to_lemma.tsv
git add form_to_lemma.tsv
git commit

Note: if you skip step 4 and just replace the file and use git diff, it won't work since you'll see the changes due to re-sorting the file plus any new changes. You would need to commit the sorted file first and then replace it so that git will only see the differences after that. Either method works.

arthurian commented 1 year ago

@jaguillette Good point. I didn't use a script for this, so I just updated the description with the process I followed. It could be scripted if this is more than a one time thing.