Novelupdates: Fix this source

greenaar commented 9 months ago

Yes

Novel URL: Any on Novelupdates, eg https://www.novelupdates.com/series/yigret App Location: PIP App Version: v3.4.0

Describe this issue

This source was working earlier today with this version of the application, and just stopped working within the last 8 hours. Below is a debug output from the example series scrape:

Namespace(log=3, log_file=None, list_sources=False, crawler=[], novel_page='https://www.novelupdates.com/series/yigret', query=None, login=None, output_formats=[], add_source_url=False, single=True, multi=False, output_path=None, filename=None, filename_only=True, force=False, ignore=False, all=True, first=None, last=None, page=None, range=None, volumes=None, chapters=None, proxy_file=None, auto_proxy=False, bot=None, shard_id=0, shard_count=1, selenium_grid=None, suppress=True, ignore_images=False, close_directly=False, extra={}) 16:49:04 [DEBUG] (lncrawl.core.sources) Loading current index data from /root/.lncrawl/sources/_index.json 16:49:04 [DEBUG] (lncrawl.core.sources) Current index was already downloaded once 16:49:04 [DEBUG] (lncrawl.core.sources) Saving current index data to /root/.lncrawl/sources/_index.json 16:49:04 [DEBUG] (lncrawl.core.sources) Saving current index data to /root/.lncrawl/sources/_index.json

➡ Press Ctrl + C to exit

16:49:04 [INFO] (lncrawl.core.app) Initialized App 16:49:04 [INFO] (lncrawl.bots.console.integration) Detected URL input 16:49:04 [INFO] (lncrawl.core.sources) Initializing crawler for: https://www.novelupdates.com/ [/data/python/lncrawl/lib/python3.10/site-packages/sources/multi/novelupdates.py] Retrieving novel info... 16:49:04 [DEBUG] (lncrawl.core.scraper) [GET] https://www.novelupdates.com/series/yigret timeout=(7, 301), allow_redirects=True, proxies={}, headers={b'Origin': b'https://www.novelupdates.com', b'Referer': b'https://www.novelupdates.com/', b'User-Agent': b'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:105.0) Gecko/20100101 Firefox/105.0'} 16:49:04 [DEBUG] (urllib3.connectionpool) Starting new HTTPS connection (1): www.novelupdates.com:443 16:49:04 [DEBUG] (urllib3.connectionpool) https://www.novelupdates.com:443 "GET /series/yigret HTTP/1.1" 301 None 16:49:04 [DEBUG] (urllib3.connectionpool) https://www.novelupdates.com:443 "GET /series/yigret/ HTTP/1.1" 200 None 16:49:04 [DEBUG] (lncrawl.core.scraper) [POST] https://www.novelupdates.com/wp-admin/admin-ajax.php data={'action': 'nd_getchapters', 'mygrr': '1', 'mypostid': '46784'}, allow_redirects=True, proxies={}, headers={b'Content-Type': b'application/x-www-form-urlencoded; charset=UTF-8', b'Origin': b'https://www.novelupdates.com', b'Referer': b'https://www.novelupdates.com/series/yigret', b'User-Agent': b'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:105.0) Gecko/20100101 Firefox/105.0'} 16:49:04 [DEBUG] (urllib3.connectionpool) Starting new HTTPS connection (1): www.novelupdates.com:443 16:49:05 [DEBUG] (urllib3.connectionpool) https://www.novelupdates.com:443 "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 None

❗ Error: No chapters found <class 'Exception'> File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/bots/console/integration.py", line 107, in start raise e File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/bots/console/integration.py", line 101, in start _download_novel() File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/bots/console/integration.py", line 85, in _download_novel self.app.get_novel_info() File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/core/app.py", line 137, in get_novel_info raise Exception("No chapters found")

16:49:05 [INFO] (lncrawl.core.app) App destroyed

greenaar commented 9 months ago

PR #2242 applied locally fixed the issue for one day, and it now reoccurs - I take that to mean the value has changed again.

picnixz commented 9 months ago

Ok I had a quick look yesterday and I fixed locally my installation as follows:

A NU account is required and you must be connected to it on your browser.
Before starting the crawling process, some special cookies must be set. I automated the process by using browser_cookie3. However, there are various issues, especially when using non-Firefox browsers (according to the opened issues). Since I am a Firefox user, I had no issue but I cannot guarantee that it will be the same for the others.
The browser_cookie3 package allows you to extract the session cookies (and not the long-standing ones stored in whatever browser's profile (e.g., 'cookies.sqlite' for Firefox)) from your running browser.

The code to inject is the following: In lncrawl/templates/novelupdates.py (I leave it to you reader the task to find it on your system), inside the initialize method, you add the following lines:

class NovelupdatesTemplate(SearchableBrowserTemplate, ChapterOnlyBrowserTemplate):

    # (omitted)

    def initialize(self):
        # BEGIN PATCH
        import browser_cookie3

        prefixes = ('wordpress_logged_in_', 'wordpress_sec_')
        for cookie in browser_cookie3.load(domain_name='www.novelupdates.com'):
            if any(map(cookie.name.startswith, prefixes)):
                self.set_cookie(cookie.name, cookie.value)
        # END OF PATCH

        self.init_executor(
            workers=4,
        )

In addition, you need to install browser-cookie3 as an additional Python dependency via pip (e.g., python -m pip install browser-cookie3, replacing python by whatever python executable you are using).

If you don't trust this process (debatable from a security point of view), you can instead do the following manipulation (recommended if you want to be sure that the cookies are the correct ones (see below for explanation)):

class NovelupdatesTemplate(SearchableBrowserTemplate, ChapterOnlyBrowserTemplate):

    # (omitted)

    def initialize(self):
        # BEGIN PATCH
        import os

        wp_cookie_logged_in = os.getenv('NU_WORDPRESS_LOGGED_IN', ':')
        data = wp_cookie_logged_in.split(':', maxsplit=1)
        if len(data) == 2 and all(data):
            self.set_cookie(data[0], data[1])

        wp_cookie_sec = os.getenv('NU_WORDPRESS_SEC', ':')
        data = wp_cookie_sec.split(':', maxsplit=1)
        if len(data) == 2 and all(data):
            self.set_cookie(data[0], data[1])
        # END OF PATCH

        self.init_executor(
            workers=4,
        )

Now, every time you are using lncrawl from your command-line you need to do the following:

Open any NU page in your favorite browser and be connected to your account on NU.
Open the console (F12 or CTRL+SHIFT+I usually) and go to "Storage" tab. Here, you'll find the different session cookies. Find those that are stored under "www.novelupdates.com" and look for the two cookies whose name start by:
- wordpress_loggedin*
- wordpresssec*
Note: you might have multiple cookies named similarly (like multiple cookies that start with wordpress_logged_in or wordpress_sec. Pick the one that was last accessed).
Let's assume that the cookies are named wordpress_logged_in_FOO and wordpress_sec_BAR and that their values are THE_FOO and THE_BAR respectively. Then, you should run lncrawl as follows:

# Unix-based
export NU_WORDPRESS_LOGGED_IN="wordpress_logged_in_FOO:THE_FOO"
export NU_WORDPRESS_SEC="wordpress_sec_BAR:THE_BAR"
lncrawl [...]

# Windows-based
SET NU_WORDPRESS_LOGGED_IN="wordpress_logged_in_FOO:THE_FOO"
SET NU_WORDPRESS_SEC="wordpress_sec_BAR:THE_BAR"
lncrawl [...]

Note that in both cases, you need a colon (:) to separate the cookie name from its value. I haven't actually looked at how the suffix of each cookies is generated but it's likely a hash sum based on your username, and thus I cannot predict it for you.

@dipu-bd Which approach would you suggest? I don't mind making a PR for either solutions, possibly exposing the env-based approach to the CLI for the sake of usability (I don't actually know whether the infrastructure is already present for handling such things so feel free to argue).

UPDATE

I encountered a little issue with my patch where browser_cookie3 does not select the correct cookies (I have multiple browsers and versions). So, you can replace

        for cookie in browser_cookie3.load(domain_name='www.novelupdates.com'):

by

        cookie_file = None
        for cookie in browser_cookie3.chrome(cookie_file, domain_name='www.novelupdates.com'):

or

        cookie_file = None
        for cookie in browser_cookie3.firefox(cookie_file, domain_name='www.novelupdates.com'):

depending on whether you are on Chrome or Firefox. If the cookies are not correctly found (and you'll see it directly if the chapters cannot be queried), either my patch needs to be updated (feel free to ping me if no PR has been made) or you need to manually select the cookie file. Since maintaining this approach might be quite difficult for people without any idea, I'd suggest using the second technique where you specify the cookies directly as an environment variable.

blarghbl123 commented 9 months ago

How about this, is there a way to have LNCrawl parse a local file instead of NovelUpdates.com directly?

When you are logged into NovelUpdates and looking at a series root under "Latest Release" there is a Hamburger button called "Chapter Listing".

This generates a new frame that can be saved locally as a HTML file that has the NovelUpdates redirect links in this form "https://www.novelupdates.com/extnu/7871444/" to all the chapters.

e.g.

</li><li class="sp_li_chp odd"><a title="Go to chapter page" href="https://www.novelupdates.com/nu_goto_chapter.php?sid=58110&rid=7833141"><i class="fa fa-reply fa-rotate-180 fa-flip-horizontal fn" aria-hidden="true"></i></a><a href="https://www.novelupdates.com/extnu/7833141/" data-id="7833141"><span title="v13c18">v13c18</span>

These extnu links do not require login to NovelUpdates to use. So if this local file can be parsed instead then one doesn't need to go through the Browser_Cookie/Session cookie methods.

picnixz commented 9 months ago

The "Chapter Listing" requires you to be logged in (try it in private browsing and you won't have it). Also, that's how I figured out that we needed the cookies. I haven't looked have other API endpoints though.

If however you want to manually write this local file everytime you want to crawl something, then yes your method can work but to automate it we need those session cookies.

blarghbl123 commented 9 months ago

Ok got it, for automation that makes total sense. Just to me for the meantime as a workaround (especially for the people not well versed enough to handle your very detailed comment in extracting the session cookie reliably in all environments), it just seemed a lot easier to do the manual process of login to NU and dump the "Chapter Listing" to a local html file and then have LNCrawl process this file with a slightly modified novelupdates template, than to go through the manual steps digging out a browser session cookie each time. Also unfortunately I really dislike adding additional browser extensions, plus I hate firefox... sorry... don't kill me. Plus I feel this then is also applicable to a number of sites that require logins. Being able to parse a local saved URL file with a slightly modified template. I suppose though that creates the headache of having doubled the templates used if the URL file doesn't match the expected output like in NU's case of using "Chapter Listing" vs series root URL path.

Does LNCrawl already perhaps have a built-in method of handling a local HTML file and just dumping any HTML links it finds in it to LNCrawl output? Sometimes I've had this idea of wanting to use LNCrawl to just crawl a generic website at 0 to 3 level spider depth, that's not WN/LN for instance to PDF output as a general tool.

greenaar commented 9 months ago

@picnixz I followed your steps, to be specific the second set using environment variables. Worked perfectly, and the instructions were clear enough for me to be able to get the cookies out without needing any extra tools. I like the second option better than the first, since I don't run this on a machine with a browser.

picnixz commented 9 months ago

@blarghbl123

Also unfortunately I really dislike adding additional browser extensions, plus I hate firefox... sorry... don't kill me.

Don't worry (actually, what I suggested does not need any browser extension)

Plus I feel this then is also applicable to a number of sites that require logins. Being able to parse a local saved URL file with a slightly modified template. I suppose though that creates the headache of having doubled the templates used if the URL file doesn't match the expected output like in NU's case of using "Chapter Listing" vs series root URL path.

IIRC, it's possible to do it since LNC supports specifying chapter URLs, however I'm not sure that the crawling method would be the same. When using NU, the crawler seems 'generic' and thus handles those sites for which the crawling may fail. I haven't digged enough into the codebase so someone who worked on that would be better.

Does LNCrawl already perhaps have a built-in method of handling a local HTML file and just dumping any HTML links it finds in it to LNCrawl output

You can always make a script to pass the URLs to --chapters but as I said, I'm not sure that it works.

@greenaar Happy that it worked for you. I still think that the second one is a better approach because it's a pain to locate the correct cookie file... (and I'm not even sure that my file-based solution perfectly works).

blarghbl123 commented 9 months ago

@picnixz ah ok, thank you very much for the suggestions and advice.

Barvely commented 7 months ago

@picnixz I am a newbie at python. I installed the lncrawl module in pycharm using the pip command, tried both the browser cookie and your suggestion one at a time, but it didn't work out. What happened with the browser cookie was that lncrawl opened up a chrome window (how does it open a chrome window without my extensions or bookmarks) and then closed itself with no chapter found, even though I specifically used your browser_cookie3.firefox line.

I don't understand how the set command works either. I tried typing them in the pycharm terminal, then opened up lncrawl, but it simliarly failed like the above result.

Also, is it possible to change how lncrawl opens up a default browser? Thanks in advance.

picnixz commented 7 months ago

The default browser seems to always be google-chrome when it needs a browser. I've run into similar issues actually but I'll summarize them this evening instead (ping me tomorrow if I forgot). I'll try coming up with a fix to land on master since I think NU must be supported correctly.

Barvely commented 7 months ago

@picnixz Is it possible to give more detailed instructions on how to make NU work?

picnixz commented 7 months ago

I don't use it directly on Windows since I'm on linux so I cannot really help for that OS. Now, what I know is:

always be sure that you have the latest version of lncrawl
when you have only one browser, namely firefox, some novels need google chrome because they somewhat access the pages using a headless browser (i.e., as if you were running GC but without a window appearing)
some NU pages have multiple translation groups and some of those groups have dead links; unfortunately lncrawl does not skip those dead links and hangs indefinitely. I have another patch for such novels but let's make it work first for those with a single translation group.
the browser_cookie3 patch works for me on Linux, but it appears that your cookie file cannot be found (hence the "no chapters"). What I find weird is that the approach with environment variables (i.e., where you specify NU_WORDPRESS_LOGGED_IN directly) does not work either (the browser_cookie3 step is only to get the cookies for authentication and should not depend on whichsoever browser you decided to take).

I don't understand how the set command works either. I tried typing them in the pycharm terminal, then opened up lncrawl, but it simliarly failed like the above result.

Here are some better instructions (with visuals).

First apply the patches on the sources. You need to edit them so be sure that when you write lncrawl you are using the "edited" version. To verify whether it is the case or not, you can do the following:
- Add some print(1234) at the bottom of the file you were editing (namely lncrawl/templates/novelupdates.py). For instance,
```
class NovelupdatesTemplate(...):
    # (omitted)
    ...

print(1234)
```
- Run python.exe -m lncrawl.templates.novelupdates (on Windows) or python3 -m lncrawl.templates.novelupdates (on Unix) and check if you have some "1234" being printed. If not, you are not using the correct sources so it's an indicator of whether your patch has ever been considered.
Open PyCharm normally and open the PyCharm terminal, e.g.:

Export the variables to the environment (the export or the SET lines, depending on your OS).

Still in the terminal, write lncrawl -q [your URL] and it should work:

Barvely commented 7 months ago

I added the print function in the code, and the '1234' does appear in the terminal.

With the fix of using your given firefox specifically

     cookie_file = None
    for cookie in browser_cookie3.firefox(cookie_file, domain_name='www.novelupdates.com'):

It simply calls out a chrome window, without any extensions or cookies installed, executes codes from there and fail. I assume there's some coding in lncrawl that only opens up a freshly installed chrome window and never the firefox that I'm actually using, so the cookies-grab-links are never done.

However, this time I tried out the chrome code as well (and logged in to NU by chrome), with the code

    for cookie in browser_cookie3.chrome(cookie_file, domain_name='www.novelupdates.com'):

It gave me this error in the terminal and doesn't open up a chrome window.

    could not detect version_main.therefore, we are assuming it is chrome 108 or higher

The manual cookie grab opens a blank chrome window and fails like firefox. My version of lncrawl is 3.5.0, as checked by lncrawl -v.

picnixz commented 7 months ago

With the fix of using your given firefox specifically

Don't use this patch and try the patch with environment variables only. It appears that browser_cookie3 is dissident sometimes. Also, the Chrome window opening issue is a separate one so let's try first fixing the chapter's issue.

Barvely commented 7 months ago

I think I tried every fix that you proposed. I ultimately let the non-extensions-cookies chrome browser open the source novel, ran with and without your proposed fixes in the code (to test if there's any difference), quickly logged in manually by opening a new tab, then refreshed the original novel webpage that was opened, and this action captured all the links. A new browser for woopread appeared and ran through the links slowly, and my choice of epub conversion worked. I'm pretty sure that this means the capturing of the chapters are all correct.

I think somewhere along the code, lncrawl either ignores my default browser or your cookies fixes, and defaults to opening a chrome browser with no cookies. I'll try updating whatever software I can and see if things change in the meantime.

Edit: I tried restarting my pc, created a new project, installed everything again, didn't work. It calls up the non-extension/cookie chrome browser.

picnixz commented 4 months ago

My patch no longer works but I may have a solution. I suspect that NU updated something after possibly observing what I actually suggested as a workaround and thus I will not post any solution publicly anymore (nor privately). Sorry.

Barvely commented 3 months ago

My patch no longer works but I may have a solution. I suspect that NU updated something after possibly observing what I actually suggested as a workaround and thus I will not post any solution publicly anymore (nor privately). Sorry.

Oh, I didn't even realize there was a new update from you. To be honest, your code never worked for me and it's probably because I'm using windows, and I have been manually writing in the login details after lncrawl calls up a new browser window. Lately, even this method fails because the chrome window doesn't even get called up, along with various other exception errors in lncrawl. Thanks for the help in any case.

dipu-bd / lightnovel-crawler

Novelupdates: Fix this source #2241

Describe this issue