Closed looneyapache closed 2 years ago
Sorry, I don't quite understand what junk characters are. Can you give an example, like input is blablabla and expected output(result) is blablabla.
Sure please,
Following is input text"
The quick *~~~Brown f%%ox jumps) right~ over$ the lazy! dog).
Expected output (Clean) :
The quick brown fox jumps right over the lazy dog.
Oh no, what happened to the text. 😂
This feature is easy to implement I think, but I wonder in what circumstance that you will encounter such embarrassing text?
It typically happens when I use OCR for old documents (hand typed documents) or corrupted PDF's or Old dbase/Foxpro Memo files (corrupted). I think its mainly because old documents are yellow and smudgy and are hard to scan where OCR inserts characters on its own :)
Here is one example I picked from Google, Some documents have close resemblance to example : )
The pictures you provide are illegible even by myself, what is the copy text like? My OCR result is below
The preser.t rerort is Oric of 玉 numbr wiich dr:prr4lrex duri上i anJ 1945 fpr the Frreign poncnic 永ministration Lmembevs:st t.
the unitedi St:tes Tarift' Cornissint:. Orine to the desire of thr 我 、” Econoniy Aininistration to obt ir this matcri.1. !.xs prompt1y as pces1.o, the reports yere Yot revievvd by the Trri:: Connissien. A11 st:tenont.s o1 fagt or opinion in tese renorts CI &ttributithlp t; the. irciyilei Etaef nembers tho prraredi th.em. Th.:Y. 1l,洲以:rieinlt:itsed f conf idential u: of Goverrnent xgencivs, ut .•r(〉 noR brin.: Hdpnniis with the consent oi the For(imr. Eocnnic iuiri:.istrtior:.
If the copy text is similar to this, I don't think directly deleting junk characters can get the expected text.
Thank you for responding! :)
I am noobie and I posted same issue on Obsidian forum, requesting for help. ( https://forum.obsidian.md/t/replace-all-asterisks-in-a-given-file/35238/25?u=looney.apache )
Solution was not as elegant as yours - But it works for me (at least for now )
I paste my text to be scrubbed in to https://textcleaner.net/ and get back cleaned text as need be. Also not all of my ocr's are in such bad shape - Most of them are good and require some scrubbing to be useful
I value your assistance. Thank you!
The webpage is fantastic, it supports a lot of configs. The only fly in the ointment is that it cannot be used in Obsidian.
Though it seems that I can refer to the code of textcleaner, I'm not sure whether there're some issues with copyright.
But I don't think adding such many configs in Ob is a good idea, since it occupies a huge space. Such a dilemma 😂
True! Also , I've been using "Obsidian Text Format" and I've discovered that it's fantastic at fixing broken paragraphs.
Unquestionably one of the GOOD plugins.
Hi,
Your plugin works great!
Any possibility we could have a feature where in it removed junk characters from paragraph/pasted text, just leaving "."/periods where they are.
Thank you!