ML-Herbarium: Feature - label transcription dictionaries & code clean up

Cleaned up and added main function
Converted lists to dictionaries wherever applicable
Parallelized most compute-intensive functions
Removed duplicate taxon genera in datasetscraping.py

Tested transcription with new data, metrics outputted are:


taxon acc: 28/120 = 23.333333333333332%
taxon no match: 39/120 = 32.5%
taxon wrong: 53/120 = 44.166666666666664%

geography acc: 1/120 = 0.8333333333333334% geography no match: 45/120 = 37.5% geography wrong: 74/120 = 61.66666666666667%


The accuracy is not great. This could be for a number of reasons, listed in order of likelihood:
1. The corpus and ground truth files do not have text that accurately matches the labels
2. The matching algorithm needs fine-tuned or replaced
3. We should use segmentation to de-noise
4. Our OCR model needs to be retrained or replaced

BU-Spark / ml-herbarium

ML-Herbarium: Feature - label transcription dictionaries & code clean up #53