aitjcize / cppman

C++ 98/11/14 manual pages for Linux/MacOS
GNU General Public License v3.0
1.27k stars 79 forks source link

cppman can't download and parse pages from en.cppreference.com correctly #113

Closed NobodyXu closed 4 years ago

NobodyXu commented 4 years ago
$ cppman -s cppreference.com std::cout
Source set to `cppreference.com'.

$ cppman std::cout
error: bad escape \e at position 0

$ cppman -C
$ cppman -s cppreference.com -r
Source set to `cppreference.com'.

$ cppman -c -o 
By default, cppman fetches pages on-the-fly if corresponding page is not found in the cache. The "cache-all" option is only useful if you want to view man pages offline. Caching all contents will take several minutes, do you want to contin
ue [y/N]? y
Caching manpages from cppreference.com ...
Caching std::numeric_limits::round_error ...
Retrying ...
Retrying ...
Retrying ...
Error caching std::numeric_limits::round_error ...
Caching C++ concepts: Destructible ...
Retrying ...
Retrying ...
Retrying ...
Error caching C++ concepts: Destructible ...
Caching Move constructors ...
Retrying ...
Retrying ...
Retrying ...
Error caching Move constructors ...
Caching std::pow(std::valarray) ...
Retrying ...
Retrying ...
Retrying ...
Error caching std::pow(std::valarray) ...
Caching std::recursive_mutex::try_lock ...
Retrying ...
^C
Aborted.

It seems that cppman is having problems parsing pages from cppreference, my browser can access cppreference fairly quickly

$ ping en.cppreference.com                                                                                                                                                                                                   
PING en.cppreference.com (74.114.90.46) 56(84) bytes of data.
64 bytes from cppreference.com (74.114.90.46): icmp_seq=1 ttl=47 time=250 ms
64 bytes from cppreference.com (74.114.90.46): icmp_seq=2 ttl=47 time=181 ms
64 bytes from cppreference.com (74.114.90.46): icmp_seq=3 ttl=47 time=203 ms
64 bytes from cppreference.com (74.114.90.46): icmp_seq=4 ttl=47 time=227 ms
64 bytes from cppreference.com (74.114.90.46): icmp_seq=5 ttl=47 time=248 ms
64 bytes from cppreference.com (74.114.90.46): icmp_seq=6 ttl=47 time=170 ms
^C
--- en.cppreference.com ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5005ms
rtt min/avg/max/mdev = 169.600/213.190/250.107/31.174 ms

Edit:

$ cppman --version
/usr/bin/cppman Ver 0.4.9
Copyright (C) 2010 Wei-Ning Huang
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Wei-Ning Huang (AZ) <aitjcize@gmail.com>.
aitjcize commented 4 years ago

I can not reproduce this, maybe you are rate limited at that time? Can you try again?

NobodyXu commented 4 years ago

I tried it again, but it still failed.

NobodyXu commented 4 years ago

No matter I used my wifi or hotspot from my iPhone, it all failed regardless.

Using either my wifi or my hotspot, I can connect to cppreference normally, except that it takes fair amount of time to load the page.

I am suspecting that the crawler set the timeout so low that it considered cppreference as non-accessible while it can be accessed.

hexclover commented 4 years ago

I'm using 0.4.9 and also running into this issue. It seems the cppman -c error is also a bad escape \e at position 0 (though not printed), which comes from line 211 of formatter/cppreference.py:

data = re.compile(rp[0], rp[2]).sub(rp[1], data)

It's an error that occurred during the application of a RegEx. Maybe it has to do with the Python version (I'm on 3.7)?

hexclover commented 4 years ago

Eh, I took a look into formatter/{cplusplus,cppreference}.py and found two almost identical strings in them. In cplusplus.py:

    # Preserve \n" in EXAMPLE
    (r'\\n', r'\\en', 0),

In cppreference.py a \ is missing from the second raw string:

    # Preserve \n" in EXAMPLE
    (r'\\n', r'\en', 0),

If I add it back then cppman will work just fine...

Update: I tried Python 2.7, 3.6 and 3.7 with the following:

import re
re.compile(r'\en')

The result is Python 3.x will error (bad escape \e) while Python 2.7 will not.

Update 2: There is another difference in these files that causes errors:

formatter/cplusplus.py
169:        tbl = re.compile(r'T{\n(\..*?)\nT}', re.S).sub(r'T{\n\\E \1\nT}', tbl)

formatter/cppreference.py
40:    tbl = re.compile(r'T{\n(\..*?)\nT}', re.S).sub(r'T{\n\E \1\nT}', tbl)
aitjcize commented 4 years ago

@hexclover thanks for figure out the issue, will send a fix for that.

Edit: this specific issue was actually fixed on master already.