Benature / obsidian-text-format

Format seleted text in Obsdidian.md
MIT License
187 stars 18 forks source link

Feature Request - Remove junk Characters from Text #23

Closed looneyapache closed 2 years ago

looneyapache commented 2 years ago

Hi,

Your plugin works great!

Any possibility we could have a feature where in it removed junk characters from paragraph/pasted text, just leaving "."/periods where they are.

Thank you!

Benature commented 2 years ago

Sorry, I don't quite understand what junk characters are. Can you give an example, like input is blablabla and expected output(result) is blablabla.

looneyapache commented 2 years ago

Sure please,

Following is input text"

The quick *~~~Brown f%%ox jumps) right~ over$ the lazy! dog).

Expected output (Clean) :

The quick brown fox jumps right over the lazy dog.

Benature commented 2 years ago

Oh no, what happened to the text. 😂

This feature is easy to implement I think, but I wonder in what circumstance that you will encounter such embarrassing text?

looneyapache commented 2 years ago

It typically happens when I use OCR for old documents (hand typed documents) or corrupted PDF's or Old dbase/Foxpro Memo files (corrupted). I think its mainly because old documents are yellow and smudgy and are hard to scan where OCR inserts characters on its own :)

looneyapache commented 2 years ago

Here is one example I picked from Google, Some documents have close resemblance to example : bitonal-doc )

Benature commented 2 years ago

The pictures you provide are illegible even by myself, what is the copy text like? My OCR result is below

The preser.t rerort is Oric of 玉 numbr wiich dr:prr4lrex duri上i anJ 1945 fpr the Frreign poncnic 永ministration Lmembevs:st t.

the unitedi St:tes Tarift' Cornissint:. Orine to the desire of thr 我 、” Econoniy Aininistration to obt ir this matcri.1. !.xs prompt1y as pces1.o, the reports yere Yot revievvd by the Trri:: Connissien. A11 st:tenont.s o1 fagt or opinion in tese renorts CI &ttributithlp t; the. irciyilei Etaef nembers tho prraredi th.em. Th.:Y. 1l,洲以:rieinlt:itsed f conf idential u: of Goverrnent xgencivs, ut .•r(〉 noR brin.: Hdpnniis with the consent oi the For(imr. Eocnnic iuiri:.istrtior:.

If the copy text is similar to this, I don't think directly deleting junk characters can get the expected text.

looneyapache commented 2 years ago

Thank you for responding! :)

I am noobie and I posted same issue on Obsidian forum, requesting for help. ( https://forum.obsidian.md/t/replace-all-asterisks-in-a-given-file/35238/25?u=looney.apache )

Solution was not as elegant as yours - But it works for me (at least for now )

I paste my text to be scrubbed in to https://textcleaner.net/ and get back cleaned text as need be. Also not all of my ocr's are in such bad shape - Most of them are good and require some scrubbing to be useful

I value your assistance. Thank you!

Benature commented 2 years ago

The webpage is fantastic, it supports a lot of configs. The only fly in the ointment is that it cannot be used in Obsidian.

Though it seems that I can refer to the code of textcleaner, I'm not sure whether there're some issues with copyright.

But I don't think adding such many configs in Ob is a good idea, since it occupies a huge space. Such a dilemma 😂

looneyapache commented 2 years ago

True! Also , I've been using "Obsidian Text Format" and I've discovered that it's fantastic at fixing broken paragraphs.

Unquestionably one of the GOOD plugins.