Open keighrim opened 5 years ago
With the new set of data (19d) that has very different characteristics from data used in the June demo (534), we need to re-think the NLP pipeline we'll use for the upcoming demo.
19d
534
Starting from the raw tesseract output, here is the size (only printing the last line):
$ wc /data/dtriac/dtriac-19d/all/*/tesseract-300dpi-20p.txt 13958734 63579505 400100406 total
Pipeline:
With the new set of data (
19d
) that has very different characteristics from data used in the June demo (534
), we need to re-think the NLP pipeline we'll use for the upcoming demo.