brandeis-llc / dtriac-pipeline

Preprocessing pipelines for DTRIAC project
Apache License 2.0
0 stars 0 forks source link

DTRIAC NLP Pipeline #10

Open keighrim opened 5 years ago

keighrim commented 5 years ago

With the new set of data (19d) that has very different characteristics from data used in the June demo (534), we need to re-think the NLP pipeline we'll use for the upcoming demo.

marcverhagen commented 4 years ago

Starting from the raw tesseract output, here is the size (only printing the last line):

$ wc /data/dtriac/dtriac-19d/all/*/tesseract-300dpi-20p.txt 
13958734  63579505 400100406 total

Pipeline: