LagradOst / QuickNovel

Android app for downloading novels
MIT License
1.05k stars 62 forks source link

Permit stripping of SCRIPT tags when creating ebook #101

Open adityaravishankar opened 2 years ago

adityaravishankar commented 2 years ago

Some generated books tend to have adsense code within the book content. This somehow renders as plain adsense code within paragraphs of text.

Can we add a Remove Script tags option below Remove clutter in settings? Either that or include stripping these tags (along with the code inside) in the Remove clutter option itself.

Here is an example copied from my ebook reader:

“Okay, he will be the 16th person called up,” The man said after taking down the details of Grey

(adsbygoogle = window.adsbygoogle || []).push({});

“Okay sir” Martha turned and was about to leave with Grey

Blatzar commented 2 years ago

Where are you seeing these? Which source?

samlux04 commented 2 years ago

Here's an example https://azynovel.com/novel/martial-peak

Here's how it rendered in moon+ reader. Screenshot_20220716-142457_Moon+ Reader Pro Remove the .zip Martial Peak.epub.zip

Tho it's rendered just fine in quicknovel itself Screenshot_20220716-144219_QuickNovel

Blatzar commented 2 years ago

Probably fixed in https://github.com/LagradOst/QuickNovel/commit/fb053fb8d3b1f4c7f638db96ec65a7392738156d Say if you see anything after the next update (You might need to regenerate the epub though)

samlux04 commented 2 years ago

Do we even need a JavaScript in epub? Why don't remove all the non text. Afaik only epub 3 support JavaScript. And moon reader follow the specs to run js. I tried with other epub reader book reader, alreader etc. They don't run JavaScript hence there's no ads text showed up. IMO only official publisher would release an epub with js for whatever fancy features they need. But for scrapers like this. We don't need any fancy features. Just clean text with h3 chapter title is enough.

Blatzar commented 2 years ago

It's not a matter of that being a choice we made, no we don't want or need any js tags in our texts, they just 'snuck' in there when we scraped. Simply an oversight which will be fixed.

Why don't remove all the non text.

Unfortunately it's not that simple to program something which recognizes what is proper text and what is not. I will play around a bit with a script tag regex if this issue persists however :shrug:

samlux04 commented 2 years ago

It's not a matter of that being a choice we made, no we don't want or need any js tags in our texts, they just 'snuck' in there when we scraped. Simply an oversight which will be fixed.

Why don't remove all the non text.

Unfortunately it's not that simple to program something which recognizes what is proper text and what is not. I will play around a bit with a script tag regex if this issue persists however 🤷

I understand that. That's why we are very grateful for your work. I used to write a simple Lua codes to scrap from a website with gumbo and string replace here and there. But good html parser do help a lot to exclude all non visible text.

samlux04 commented 2 years ago

Looking at the source it's actually in the proper script tag. We can just select all p tag.

<p>In those three years if you don’t break through, then you can either leave the school or be demoted to an experimental disciple.</p>
<br>
<ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-9952957309467784" data-ad-slot="2312483740"></ins>
<script type="96231c2a83138fcc394b9d19-text/javascript">
                            (adsbygoogle = window.adsbygoogle || []).push({});
                        </script>
<p>Experimental disciple is Kai Yang’s current status! He is also Sky Tower School’s shame!</p>
<br>
<p>Compared to normal disciples, their treatment is very different. Experimental disciples must provide for their food, shelter, clothing, for the outer court will no longer waste cultivating resources on these trashes. Once demoted to experimental disciples, you basically can never advance. Unless you manage to increase your cultivation level quickly. Only then will the school let you attempt to become a true disciple.</p>
Blatzar commented 2 years ago

It is likely already solved, but if you really don't want to risk any script stuff getting in you are welcome to contribute, just remember to keep the <br> intact as those do affect the text too.

samlux04 commented 2 years ago

I looked at the source. Since it use jsoup. It just make it easier to remove the script element. Maybe add general cleanup after loadHtml on all provider. just simplly iterate all elements and remove Githubissues.

  • Githubissues is a development platform for aggregating issues.