Closed artur-shaik closed 4 years ago
The site has the following line in the HTML header:
<noscript><meta http-equiv="refresh" content="0; URL=/badbrowser.php"></noscript>
But if I test in Full-Text RSS, which uses PHP/7.2 as the HTTP User-Agent header value by default, the site redirects me to the mobile version of the requested page at m.vk.com (which has no badbrowser.php
redirect). So that might be an option - to explicity specify a non-browser User Agent string. Are you able to check to see if the mobile version contains the content you need?
Otherwise I think using replace_string
in the the site config file to rewrite the refresh meta tag should work - at least for Full-Text RSS.
Yeah, adding http_header(user-agent): PHP/7.2
to vk.com.txt
helps. Now I can parse m.vk.com
.
Thank you.
👍 Thanks for the update. Haven't tested much, but added that line to vk.com.txt for now.
If you have additional extraction rules for the site, feel free to contribute.
If you have additional extraction rules for the site, feel free to contribute.
Yes, I would like to, but have some problems with image fetching. Can I ask advice here?
Yes, of course.
<div class="wi_body">
<div class="pi_text" onclick="return post.wallPostOpen(this, event);">TEXT CONTENT IS HERE.<br><br><a href="/dighistory?q=%23%D0%9D%D0%BE%D0%B2%D0%BE%D0%B5%D0%B2%D1%80%D0%B5%D0%BC%D1%8F">#Новоевремя@dighistory</a></div>
<div class="pi_medias thumbs_list thumbs_list1 audios_list medias_audios_list">
<div class="medias_thumbs medias_thumbs_map">
<div class="thumbs_map_wrap" style="width: 374px;">
<div class="thumbs_map_helper" style="padding-top: 96.2567%;">
<div class="thumbs_map fill">
<a href="/photo-144904445_457265768?rev=1&post=-144904445_313083&from=post" aria-label="фотография" style="width: 100%; height: 100%; margin: 0 0% 0% 0;" class="thumb_map thumb_map_wide thumb_map_l al_photo" onclick="return photo.zopen(this, event, '-144904445_457265768', 'album-144904445_00/rev');">
<div style="margin-left: -0.0442%; background-image: url(https://sun7-6.userapi.com/ROtMZHPCGAJt2vL_IQ4aCWGxZkO6tOevoeP4cw/nsWvMN4Yz8E.jpg);" class="thumb_map_img thumb_map_img_as_div" data-id="-144904445_457265768" data-src_big="https://sun7-9.userapi.com/4lJNgGc76cfFF7CQvSkxgYO2pXnPhu972KQkpA/SqFtZpo1mi8.jpg|685|658" data-restriction="null"></div>
</a>
</div>
</div>
</div>
</div>
</div>
</div>
It is typical post on vk.com. And here is my body
config I used:
body: //div[contains(concat(' ',normalize-space(@class),' '),' pi_text ')] | //div[contains(@class, 'pi_medias')]
But image couldn't be fetched.
I have tryed to use replace_string
, but not sure how I can extract those image from style
or data-src_big
attribute.
There's no good way to work with background-image URLs in Full-Text RSS at the moment. Perhaps there should be. So solutions have to be a little hacky. In such cases you have to get a little creative with replace_string
. I usually look to see how much of the surrounding markup is standard template strings.
So for example, I'd want to try and turn this...
<div style="margin-left: -0.0442%; background-image: url(https://sun7-6.userapi.com/ROtMZHPCGAJt2vL_IQ4aCWGxZkO6tOevoeP4cw/nsWvMN4Yz8E.jpg);" class="thumb_map_img thumb_map_img_as_div" ...></div>
into this...
<img src="https://sun7-6.userapi.com/ROtMZHPCGAJt2vL_IQ4aCWGxZkO6tOevoeP4cw/nsWvMN4Yz8E.jpg">
Here's what the site config looks like in my attempt:
body: //div[contains(concat(' ',normalize-space(@class),' '),' wi_body ')]
find_string: background-image: url(
replace_string: display:none;"></div><img src="
find_string: );" class="thumb_map_img
replace_string: "><div style="display:none;
strip_id_or_class: wi_like_wrap
strip_id_or_class: wi_buttons
prune: no
It's essentially stripping the first <div>
element into three elements: <div style="display:none"></div><img src="[background-image URL]"><div style="display:none ..."></div>
Full-Text RSS strips any element with style="display:none;"
, so those elements will just get removed in the final output. But even if they don't, they shouldn't display anything.
There is another issue with it. This tag:
<a href="/photo-144904445_457265768?rev=1&post=-144904445_313083&from=post" aria-label="фотография" style="width: 100%; height: 100%; margin: 0 0% 0% 0;" class="thumb_map thumb_map_wide thumb_map_l al_photo" onclick="return photo.zopen(this, event, '-144904445_457265768', 'album-144904445_00/rev');">
<div style="margin-left: -0.0442%; background-image: url(https://sun7-6.userapi.com/ROtMZHPCGAJt2vL_IQ4aCWGxZkO6tOevoeP4cw/nsWvMN4Yz8E.jpg);" class="thumb_map_img thumb_map_img_as_div" data-id="-144904445_457265768" data-src_big="https://sun7-9.userapi.com/4lJNgGc76cfFF7CQvSkxgYO2pXnPhu972KQkpA/SqFtZpo1mi8.jpg|685|658" data-restriction="null"></div>
</a>
becomes empty after: HTML after regex empty nodes stripping
So, there is no div
tag with image as resullt.
Forget to say, that I working in wallabag, and HTML after regex empty nodes stripping
is graby
debug log.
There might be additional processing in graby that prevents my hacky solution from working. What I posted isn't really ideal, but it's the kind of thing I'd do with Full-Text RSS in this situation. @j0k3r might know if the additional cleaning can be disabled in graby or not.
What I posted isn't really ideal
Actually it's work, but graby unconditionally strip this tag.
Does that regex stripping is performed before the replace? Maybe share the full graby log collapsed
Yeah, sure:
Vk.com redirects to
/badbrowser.php
, so content couldn't be fetched. But if we ignore redirection we have access to content. Can we, somehow disable redirection for site?