Open greenaar opened 9 months ago
PR #2242 applied locally fixed the issue for one day, and it now reoccurs - I take that to mean the value has changed again.
Ok I had a quick look yesterday and I fixed locally my installation as follows:
The code to inject is the following: In lncrawl/templates/novelupdates.py
(I leave it to you reader the task to find it on your system), inside the initialize
method, you add the following lines:
class NovelupdatesTemplate(SearchableBrowserTemplate, ChapterOnlyBrowserTemplate):
# (omitted)
def initialize(self):
# BEGIN PATCH
import browser_cookie3
prefixes = ('wordpress_logged_in_', 'wordpress_sec_')
for cookie in browser_cookie3.load(domain_name='www.novelupdates.com'):
if any(map(cookie.name.startswith, prefixes)):
self.set_cookie(cookie.name, cookie.value)
# END OF PATCH
self.init_executor(
workers=4,
)
In addition, you need to install browser-cookie3
as an additional Python dependency via pip
(e.g., python -m pip install browser-cookie3
, replacing python
by whatever python executable you are using).
If you don't trust this process (debatable from a security point of view), you can instead do the following manipulation (recommended if you want to be sure that the cookies are the correct ones (see below for explanation)):
class NovelupdatesTemplate(SearchableBrowserTemplate, ChapterOnlyBrowserTemplate):
# (omitted)
def initialize(self):
# BEGIN PATCH
import os
wp_cookie_logged_in = os.getenv('NU_WORDPRESS_LOGGED_IN', ':')
data = wp_cookie_logged_in.split(':', maxsplit=1)
if len(data) == 2 and all(data):
self.set_cookie(data[0], data[1])
wp_cookie_sec = os.getenv('NU_WORDPRESS_SEC', ':')
data = wp_cookie_sec.split(':', maxsplit=1)
if len(data) == 2 and all(data):
self.set_cookie(data[0], data[1])
# END OF PATCH
self.init_executor(
workers=4,
)
Now, every time you are using lncrawl from your command-line you need to do the following:
Open any NU page in your favorite browser and be connected to your account on NU.
Open the console (F12 or CTRL+SHIFT+I usually) and go to "Storage" tab. Here, you'll find the different session cookies. Find those that are stored under "www.novelupdates.com" and look for the two cookies whose name start by:
Note: you might have multiple cookies named similarly (like multiple cookies that start with wordpress_logged_in
or wordpress_sec
. Pick the one that was last accessed).
Let's assume that the cookies are named wordpress_logged_in_FOO
and wordpress_sec_BAR
and that their values are THE_FOO
and THE_BAR
respectively. Then, you should run lncrawl
as follows:
# Unix-based
export NU_WORDPRESS_LOGGED_IN="wordpress_logged_in_FOO:THE_FOO"
export NU_WORDPRESS_SEC="wordpress_sec_BAR:THE_BAR"
lncrawl [...]
# Windows-based
SET NU_WORDPRESS_LOGGED_IN="wordpress_logged_in_FOO:THE_FOO"
SET NU_WORDPRESS_SEC="wordpress_sec_BAR:THE_BAR"
lncrawl [...]
Note that in both cases, you need a colon (:
) to separate the cookie name from its value. I haven't actually looked at how the suffix of each cookies is generated but it's likely a hash sum based on your username, and thus I cannot predict it for you.
@dipu-bd Which approach would you suggest? I don't mind making a PR for either solutions, possibly exposing the env-based approach to the CLI for the sake of usability (I don't actually know whether the infrastructure is already present for handling such things so feel free to argue).
UPDATE
I encountered a little issue with my patch where browser_cookie3
does not select the correct cookies (I have multiple browsers and versions). So, you can replace
for cookie in browser_cookie3.load(domain_name='www.novelupdates.com'):
by
cookie_file = None
for cookie in browser_cookie3.chrome(cookie_file, domain_name='www.novelupdates.com'):
or
cookie_file = None
for cookie in browser_cookie3.firefox(cookie_file, domain_name='www.novelupdates.com'):
depending on whether you are on Chrome or Firefox. If the cookies are not correctly found (and you'll see it directly if the chapters cannot be queried), either my patch needs to be updated (feel free to ping me if no PR has been made) or you need to manually select the cookie file. Since maintaining this approach might be quite difficult for people without any idea, I'd suggest using the second technique where you specify the cookies directly as an environment variable.
How about this, is there a way to have LNCrawl parse a local file instead of NovelUpdates.com directly?
When you are logged into NovelUpdates and looking at a series root under "Latest Release" there is a Hamburger button called "Chapter Listing".
This generates a new frame that can be saved locally as a HTML file that has the NovelUpdates redirect links in this form "https://www.novelupdates.com/extnu/7871444/" to all the chapters.
e.g.
</li><li class="sp_li_chp odd"><a title="Go to chapter page" href="https://www.novelupdates.com/nu_goto_chapter.php?sid=58110&rid=7833141"><i class="fa fa-reply fa-rotate-180 fa-flip-horizontal fn" aria-hidden="true"></i></a><a href="https://www.novelupdates.com/extnu/7833141/" data-id="7833141"><span title="v13c18">v13c18</span>
These extnu links do not require login to NovelUpdates to use. So if this local file can be parsed instead then one doesn't need to go through the Browser_Cookie/Session cookie methods.
The "Chapter Listing" requires you to be logged in (try it in private browsing and you won't have it). Also, that's how I figured out that we needed the cookies. I haven't looked have other API endpoints though.
If however you want to manually write this local file everytime you want to crawl something, then yes your method can work but to automate it we need those session cookies.
Ok got it, for automation that makes total sense. Just to me for the meantime as a workaround (especially for the people not well versed enough to handle your very detailed comment in extracting the session cookie reliably in all environments), it just seemed a lot easier to do the manual process of login to NU and dump the "Chapter Listing" to a local html file and then have LNCrawl process this file with a slightly modified novelupdates template, than to go through the manual steps digging out a browser session cookie each time. Also unfortunately I really dislike adding additional browser extensions, plus I hate firefox... sorry... don't kill me. Plus I feel this then is also applicable to a number of sites that require logins. Being able to parse a local saved URL file with a slightly modified template. I suppose though that creates the headache of having doubled the templates used if the URL file doesn't match the expected output like in NU's case of using "Chapter Listing" vs series root URL path.
Does LNCrawl already perhaps have a built-in method of handling a local HTML file and just dumping any HTML links it finds in it to LNCrawl output? Sometimes I've had this idea of wanting to use LNCrawl to just crawl a generic website at 0 to 3 level spider depth, that's not WN/LN for instance to PDF output as a general tool.
@picnixz I followed your steps, to be specific the second set using environment variables. Worked perfectly, and the instructions were clear enough for me to be able to get the cookies out without needing any extra tools. I like the second option better than the first, since I don't run this on a machine with a browser.
@blarghbl123
Also unfortunately I really dislike adding additional browser extensions, plus I hate firefox... sorry... don't kill me.
Don't worry (actually, what I suggested does not need any browser extension)
Plus I feel this then is also applicable to a number of sites that require logins. Being able to parse a local saved URL file with a slightly modified template. I suppose though that creates the headache of having doubled the templates used if the URL file doesn't match the expected output like in NU's case of using "Chapter Listing" vs series root URL path.
IIRC, it's possible to do it since LNC supports specifying chapter URLs, however I'm not sure that the crawling method would be the same. When using NU, the crawler seems 'generic' and thus handles those sites for which the crawling may fail. I haven't digged enough into the codebase so someone who worked on that would be better.
Does LNCrawl already perhaps have a built-in method of handling a local HTML file and just dumping any HTML links it finds in it to LNCrawl output
You can always make a script to pass the URLs to --chapters
but as I said, I'm not sure that it works.
@greenaar Happy that it worked for you. I still think that the second one is a better approach because it's a pain to locate the correct cookie file... (and I'm not even sure that my file-based solution perfectly works).
@picnixz ah ok, thank you very much for the suggestions and advice.
@picnixz I am a newbie at python. I installed the lncrawl module in pycharm using the pip command, tried both the browser cookie and your suggestion one at a time, but it didn't work out. What happened with the browser cookie was that lncrawl opened up a chrome window (how does it open a chrome window without my extensions or bookmarks) and then closed itself with no chapter found, even though I specifically used your browser_cookie3.firefox line.
I don't understand how the set command works either. I tried typing them in the pycharm terminal, then opened up lncrawl, but it simliarly failed like the above result.
Also, is it possible to change how lncrawl opens up a default browser? Thanks in advance.
The default browser seems to always be google-chrome when it needs a browser. I've run into similar issues actually but I'll summarize them this evening instead (ping me tomorrow if I forgot). I'll try coming up with a fix to land on master since I think NU must be supported correctly.
@picnixz Is it possible to give more detailed instructions on how to make NU work?
I don't use it directly on Windows since I'm on linux so I cannot really help for that OS. Now, what I know is:
browser_cookie3
patch works for me on Linux, but it appears that your cookie file cannot be found (hence the "no chapters"). What I find weird is that the approach with environment variables (i.e., where you specify NU_WORDPRESS_LOGGED_IN
directly) does not work either (the browser_cookie3
step is only to get the cookies for authentication and should not depend on whichsoever browser you decided to take).I don't understand how the set command works either. I tried typing them in the pycharm terminal, then opened up lncrawl, but it simliarly failed like the above result.
Here are some better instructions (with visuals).
First apply the patches on the sources. You need to edit them so be sure that when you write lncrawl
you are using the "edited" version. To verify whether it is the case or not, you can do the following:
print(1234)
at the bottom of the file you were editing (namely lncrawl/templates/novelupdates.py
). For instance,class NovelupdatesTemplate(...):
# (omitted)
...
print(1234)
python.exe -m lncrawl.templates.novelupdates
(on Windows)
or python3 -m lncrawl.templates.novelupdates
(on Unix) and check if you have some "1234" being printed. If not, you are not using the correct sources so it's an indicator of whether your patch has ever been considered.Open PyCharm normally and open the PyCharm terminal, e.g.:
export
or the SET
lines, depending on your OS).lncrawl -q [your URL]
and it should work:I added the print function in the code, and the '1234' does appear in the terminal.
With the fix of using your given firefox specifically
cookie_file = None
for cookie in browser_cookie3.firefox(cookie_file, domain_name='www.novelupdates.com'):
It simply calls out a chrome window, without any extensions or cookies installed, executes codes from there and fail. I assume there's some coding in lncrawl that only opens up a freshly installed chrome window and never the firefox that I'm actually using, so the cookies-grab-links are never done.
However, this time I tried out the chrome code as well (and logged in to NU by chrome), with the code
for cookie in browser_cookie3.chrome(cookie_file, domain_name='www.novelupdates.com'):
It gave me this error in the terminal and doesn't open up a chrome window.
could not detect version_main.therefore, we are assuming it is chrome 108 or higher
The manual cookie grab opens a blank chrome window and fails like firefox. My version of lncrawl is 3.5.0, as checked by lncrawl -v.
With the fix of using your given firefox specifically
Don't use this patch and try the patch with environment variables only. It appears that browser_cookie3
is dissident sometimes. Also, the Chrome window opening issue is a separate one so let's try first fixing the chapter's issue.
I think I tried every fix that you proposed. I ultimately let the non-extensions-cookies chrome browser open the source novel, ran with and without your proposed fixes in the code (to test if there's any difference), quickly logged in manually by opening a new tab, then refreshed the original novel webpage that was opened, and this action captured all the links. A new browser for woopread appeared and ran through the links slowly, and my choice of epub conversion worked. I'm pretty sure that this means the capturing of the chapters are all correct.
I think somewhere along the code, lncrawl either ignores my default browser or your cookies fixes, and defaults to opening a chrome browser with no cookies. I'll try updating whatever software I can and see if things change in the meantime.
Edit: I tried restarting my pc, created a new project, installed everything again, didn't work. It calls up the non-extension/cookie chrome browser.
My patch no longer works but I may have a solution. I suspect that NU updated something after possibly observing what I actually suggested as a workaround and thus I will not post any solution publicly anymore (nor privately). Sorry.
My patch no longer works but I may have a solution. I suspect that NU updated something after possibly observing what I actually suggested as a workaround and thus I will not post any solution publicly anymore (nor privately). Sorry.
Oh, I didn't even realize there was a new update from you. To be honest, your code never worked for me and it's probably because I'm using windows, and I have been manually writing in the login details after lncrawl calls up a new browser window. Lately, even this method fails because the chrome window doesn't even get called up, along with various other exception errors in lncrawl. Thanks for the help in any case.
Yes
Novel URL: Any on Novelupdates, eg https://www.novelupdates.com/series/yigret App Location: PIP App Version: v3.4.0
Describe this issue
This source was working earlier today with this version of the application, and just stopped working within the last 8 hours. Below is a debug output from the example series scrape:
Namespace(log=3, log_file=None, list_sources=False, crawler=[], novel_page='https://www.novelupdates.com/series/yigret', query=None, login=None, output_formats=[], add_source_url=False, single=True, multi=False, output_path=None, filename=None, filename_only=True, force=False, ignore=False, all=True, first=None, last=None, page=None, range=None, volumes=None, chapters=None, proxy_file=None, auto_proxy=False, bot=None, shard_id=0, shard_count=1, selenium_grid=None, suppress=True, ignore_images=False, close_directly=False, extra={}) 16:49:04 [DEBUG] (lncrawl.core.sources) Loading current index data from /root/.lncrawl/sources/_index.json 16:49:04 [DEBUG] (lncrawl.core.sources) Current index was already downloaded once 16:49:04 [DEBUG] (lncrawl.core.sources) Saving current index data to /root/.lncrawl/sources/_index.json 16:49:04 [DEBUG] (lncrawl.core.sources) Saving current index data to /root/.lncrawl/sources/_index.json
➡ Press Ctrl + C to exit
16:49:04 [INFO] (lncrawl.core.app) Initialized App 16:49:04 [INFO] (lncrawl.bots.console.integration) Detected URL input 16:49:04 [INFO] (lncrawl.core.sources) Initializing crawler for: https://www.novelupdates.com/ [/data/python/lncrawl/lib/python3.10/site-packages/sources/multi/novelupdates.py] Retrieving novel info... 16:49:04 [DEBUG] (lncrawl.core.scraper) [GET] https://www.novelupdates.com/series/yigret timeout=(7, 301), allow_redirects=True, proxies={}, headers={b'Origin': b'https://www.novelupdates.com', b'Referer': b'https://www.novelupdates.com/', b'User-Agent': b'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:105.0) Gecko/20100101 Firefox/105.0'} 16:49:04 [DEBUG] (urllib3.connectionpool) Starting new HTTPS connection (1): www.novelupdates.com:443 16:49:04 [DEBUG] (urllib3.connectionpool) https://www.novelupdates.com:443 "GET /series/yigret HTTP/1.1" 301 None 16:49:04 [DEBUG] (urllib3.connectionpool) https://www.novelupdates.com:443 "GET /series/yigret/ HTTP/1.1" 200 None 16:49:04 [DEBUG] (lncrawl.core.scraper) [POST] https://www.novelupdates.com/wp-admin/admin-ajax.php data={'action': 'nd_getchapters', 'mygrr': '1', 'mypostid': '46784'}, allow_redirects=True, proxies={}, headers={b'Content-Type': b'application/x-www-form-urlencoded; charset=UTF-8', b'Origin': b'https://www.novelupdates.com', b'Referer': b'https://www.novelupdates.com/series/yigret', b'User-Agent': b'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:105.0) Gecko/20100101 Firefox/105.0'} 16:49:04 [DEBUG] (urllib3.connectionpool) Starting new HTTPS connection (1): www.novelupdates.com:443 16:49:05 [DEBUG] (urllib3.connectionpool) https://www.novelupdates.com:443 "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 None
❗ Error: No chapters found <class 'Exception'> File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/bots/console/integration.py", line 107, in start raise e File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/bots/console/integration.py", line 101, in start _download_novel() File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/bots/console/integration.py", line 85, in _download_novel self.app.get_novel_info() File "/data/python/lncrawl/lib/python3.10/site-packages/lncrawl/core/app.py", line 137, in get_novel_info raise Exception("No chapters found")
16:49:05 [INFO] (lncrawl.core.app) App destroyed