YaleDHLab / voynich

Analyzing the Voynich Manuscript with computer vision
https://github.com/YaleDHLab/voynich/projects/1
7 stars 1 forks source link

metadata prep fails #55

Open chirila opened 4 years ago

chirila commented 4 years ago

This file does not exist, I don't know why it should be listed in the keys (nothing has been deleted to my knowledge since /output was created)


KeyError Traceback (most recent call last)

in () 6 bn = os.path.basename(i) 7 parent_bn = '-'.join(bn.split('-')[:-1]) + '.jpg' ----> 8 fig_to_meta[bn] = deepcopy(img_to_meta[parent_bn]) 9 fig_to_meta[bn].update({ 10 'image': bn, KeyError: '6cfbdefa-f02d-11e9-a1e4-a0999b1b3fb3.jpg'
duhaime commented 4 years ago

Interesting, this is a little tricky to diagnose because there are a few moving parts in this system. From what I can tell, though, that image is present in ~/cc/voynich/data/morgan/images/6cfbdefa-f02d-11e9-a1e4-a0999b1b3fb3.jpg. But it seems that image is not present in img_to_meta (hence the key error).

There are a few reasons why that image might be missing from that metadata map. This notebook assumes that all cells are run in linear order from top to bottom, and also assumes that the disk is static aside from the operations performed by the notebook itself. If either of those aren't true then there could be data mapping problems like the one above.

For the present, would it be alright to try one push through the pipeline with an absolutely minimal dataset? I would start by moving everything in voynich/data to some other location on disk, then just add a small sample collection to voynich/data. That should come through the pipeline just fine (I've just processed a small collection locally, and updated some of the values in the notebook to ward against wonky data). If you have any troubles with the small collection, just let me know and we'll be able to investigate much more easily than with the massive pipeline.

Longer-term, there are a few options worth considering. Right now this notebook is pretty "open" and flexible, rather than closed and robust. That seemed more appropriate for a research-oriented task, but it may well be that a more closed and robust pipeline is more appropriate for the large scale of data that we now want to process.

To transition toward a more closed system, we could propose the following: It wouldn't be too difficult to factor the voynich notebooks so as to make one resource that partitions images into figures and creates metadata for each figure. Then those outputs could be used with the neural neighbors data pipeline, which will be more robust. The vectorization strategy of the nn pipeline is entirely isolated, and can be changed in place very easily. One could e.g. use code like the convolutional autoencoder in the voynich notebook to train a custom model, persist those model weights, then load them into the vectorizer in the nn pipeline. This would allow one to remove most of the brittle data lookups in this notebook, such as the one that threw the error in this issue...