documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
833 stars 214 forks source link

Email address contains more than three special chars(punctuation) is removed by Docsplit.clean_text method #144

Open mraj-rpx opened 6 years ago

mraj-rpx commented 6 years ago

I have a email in the pdf like mohan-ramanujam@gmail.com or mohan.raman.visal@gmail.com, the corresponding line number the text_cleaner.rb file is 81 (w[1...-1].scan(PUNCT).uniq.length >= 3) || @knowtheory, @jashkenas , @samuelclay : Please provide your opinion on this.