Sampling Function in Exercise Creation

At this moment the MQL query results are being randomized on the BOL site of code with a php shuffle function. The negative side effect is that we always get a democratic representation of the data which results in exercises that do not test well less frequent words, forms, or syntax elements. To prevent such behavior Oliver has used TF searches and sampling the df of query results with the sample function of pandas. The suggestion is, that BibleOL will add to its shuffling function a sampling function.

A short description of Oliver's routine follows as it might offer a guideline for a implementation of the sampling function in BibleOL (here a video version):

First the BHS data is loaded in TF:
```
BHSa4c = use('etcbc/bhsa', version="c", mod='CenterBLC/BHSaddons/tf')
```
With it the BHSaddons (developed by Oliver) are being loaded containing most BibleOL features and the BibleOL monad numbers that are important for importing TF based query results to the BibleOL.

A query is run that has the specific exercise in mind. For example an exercise where 1-Guttural and 1-Aleph verbs are being tested. In this case it is a word base exercise and thus we need all words of the BHS. A TF query is run:

AllVerbs='''
word bol_monad_num* bol_qere_presence* bol_lexeme_occurrences* bol_vt* dagesh* lex* number* vbe* vbs* uvf* prs* pfm* nme* freq_occ* freq_lex* st* rank_occ* bol_dict_abc* bol_dict_HebArm* bol_bhsa_word_order* bol_dict_vc* ps* nu* gn* vt* vs prs_nu* prs_ps* prs_gn* sp=verb pdp* bol_dict_EN* g_word_noaccent* language* 
'''
AllVerbs  = BHSa4c.search(AllVerbs)
BHSa4c.table(AllVerbs, start=1, end=2, multiFeatures=True, condensed=False, colorMap={1: 'cyan'})

The TF query results are exported as a spreadsheet

BHSa4c.export(AllVerbs, toDir='/Users/oliverglanz/Library/CloudStorage/OneDrive-AndrewsUniversity/1200_AUS-research/Fabric-TEXT/2_OTST551-2_Hebrew/BOL_exercises/', toFile='BHSa4c_BOL_all_verb-morphology.tsv')

The Spreadsheet is loaded as a dataframe:

BHSallVerbalMorphology=pd.read_csv('/Users/oliverglanz/Library/CloudStorage/OneDrive-AndrewsUniversity/1200_AUS-research/Fabric-TEXT/2_OTST551-2_Hebrew/BOL_exercises/BHSa4c_BOL_all_verb-morphology.tsv', delimiter='\t', encoding='utf-16')

When loaded it looks like this:

Cleaning up and Removing difficult forms Now difficult forms are identified and removed. For a detailed overview consult this notebook.

Sampling Now the cleaned-up data is being sampled in such a way that I have a good distribution of verbal forms for person, number, gender, tense, stem, and pronominal suffixes:

BHSallVerbalMorphologyOTST551_sampled=BHSallVerbalMorphologyOTST551_sampled \
                                .groupby(['ps1',
                                          'gn1',
                                          'nu1',
                                          'vs1',
                                          'bol_vt1',
                                          'bol_dict_vc1',
                                          'prs_ps1',
                                          'prs_nu1',
                                          'prs_gn1']) \
                                .sample(n=3, random_state=1, replace=True)\
                                .sort_values(['bol_monad_num1',
                                              'bol_dict_vc1',
                                              'vs1',
                                              'bol_vt1',
                                              'ps1',
                                              'nu1',
                                              'gn1',
                                              'prs_ps1',
                                              'prs_nu1',
                                              'prs_gn1'], 
                                             ascending=True)
BHSallVerbalMorphologyOTST551_sampled.head(10)

Each constellation of person, number, gender, etc. has now 3 samples.

Dropping duplicates Although not necessary I always drop duplicates that were created in the sampling process. This does not have to be done since it increases again the "democratic" nature of the data which the sampling procedure sought to reduce.
```
BHSallVerbalMorphologyOTST551_sampled.drop_duplicates(subset="bol_monad_num1", keep='first', inplace=True)
```

Exporting the data Now the df is being exported.

BHSallVerbalMorphologyOTST552_sampled.to_excel('/Users/oliverglanz/Library/CloudStorage/OneDrive-AndrewsUniversity/1200_AUS-research/Fabric-TEXT/2_OTST551-2_Hebrew/BOL_exercises/0_source_BHSa4c_BOL_morphology_verbs_OTST_552_Qualifier-Selection_unfiltered_v0.3.xlsx', encoding='utf-16')

Preparing the monad numbers for the BibleOL exercise First I copy the monad_numbers of the exported data into visual studio code to add commas after each number:
Importing monad number to BibleOL Now we import the data into a BibleOL exercise. We do this by first selecting the verbal classes we want:

Then we are adding the monad_numbers we have generated:

Now we need to make sure that the Sentence Units have the same information. So we copy paste and remove the square brackets and the word NORETRIEVE:

No a perfectly sampled exercise is ready to be used.

EzerIT / BibleOL

Sampling Function in Exercise Creation #38