Schwittleymani / ECO

Electronic Chaos Oracle
https://schwittlick.net/eco
Apache License 2.0
6 stars 1 forks source link

pdf to text experiments #81

Closed schwittlick closed 8 years ago

schwittlick commented 8 years ago

some info here: http://www.howtogeek.com/228531/how-to-convert-a-pdf-file-to-editable-text-using-the-command-line-in-linux/

pdf's are here: /mnt/drive1/data/eco/pdf/

transfluxus commented 8 years ago

always use the format: "Author-first"_"author-last"-"title_seperate_words"

schwittlick commented 8 years ago

pdftotext & pdftohtml have some problems:

  1. adds line breaks after each line, even though there shouldn't be
  2. impossible to remove footnotes/page numbering
  3. shouldn't be a problem, when simply making the entire text into one long line, without line breaks
  4. before putting it all in one line, the footnotes and lots of numbers should be removed somehow automated

how to use:

pdftotext /mnt/drive1/data/eco/pdf/warren_sack-network_aesthetics.pdf /mnt/drive1/data/eco/txt/warren_sack-network_aesthetics.txt