RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.36k stars 1.04k forks source link

XenForo Bridge, scrape entry titles from span.threadmarkLabel if substring /reader/ is present in url #1739

Open Church- opened 4 years ago

Church- commented 4 years ago

Is your feature request related to a problem? Please describe. So I'm a little frustrated when scraping serialized fiction threads off XenForo based forums, with the way current feed entry titles are generated. It seems ugly and makes it slightly annoying to see what posts are updated in tt-rss for me.

Describe the solution you'd like Ideally I'd like it if for XenForo based forum threads if the url containers the substring /reader/ which denotes that it's a reader mode thread with only threadmarked posts by the thread OP we were to grab feed entry titles from span elements with the class .threadmarkLabel.

Describe alternatives you've considered Although my knowledge of PHP is nil, I've tried implementing this myself to no luck: Despite setting title in a conditional block as such:

if(strpos($url, "/reader/") !== false){
    $title = $post->find('span.threadmarkLabel', 0)->plaintext;
} else {
    $title = $post->find('div[class~="message-content"] article', 0)->plaintext;
}

under extractThreadPostsV2, I'm still getting the current default entry title scraping behavior.

Even just doing away with a conditional block and setting title always to:

$title = $post->find('span.threadmarkLabel', 0)->plaintext;

defaults to the current standard behavior somehow.

Additional context image

Ideally I'd like to scrape the text from there. It's encoded as:

<span id="threadmark-88730" class="threadmarkLabel " data-xf-init="tooltip" data-original-title="Threadmark created by UnwelcomeStorm on Aug 11, 2016">Chapter 1</span>

Ideally if someone could tell me where I'm being stupid, I'll just patch it into my local copy.

Church- commented 4 years ago

@LogMANOriginal You wrote the XenForo bridge, so perhaps you have some thoughts?

I'm admittedly puzzled on how despite changing the way I scrape entry item titles to the last method mentioned in my comment, completely replacing the title extraction with no branching based on url substring, it seems to fallback to the original behavior of the bridge. Despite that code no longer existing in my instance.

em92 commented 4 years ago

@Church-, @LogManOriginal is inactive for a long time. Not sure, he will answer.

Church- commented 4 years ago

@em92 Ah that's good to know, thank you.

Hmm, I don't suppose you might know whether I'm looking in the right place?

I had assumed the item array is to hold all the contents for a given feed item and therefore to change the title in that item array would change the feed entry title in my RSS reader.

Although perhaps I'm wrong as the current working behavior is seemingly overtaking my patch.

Or perhaps I'm just using the find() method incorrectly to extract a span class.