feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
212 stars 34 forks source link

Update Arstechnica recipe to include image galleries #135

Closed pR0Ps closed 5 years ago

pR0Ps commented 5 years ago

Rule Submission

Website: arstechnica.com

The regex is a bit ugly but does the job. Here's what an image gallery looks like in HTML (cleaned up a bit, placeholder text in []'s)

<ul>
    <li data-thumb="[tiny image url]" data-src="[fullsize image url]" data-responsive="[list of image urls followed by sizes]" data-sub-html="#caption-[caption id]">
        <figure style="height:[something]px;">
            <div class="image" style="background-image:url('[midsize image url]'); background-color:#000"></div>
            <figcaption id="caption-[caption id]">
                <span class="icon caption-arrow icon-drop-indicator"></span>
                <div class="caption">[some caption]</div>
                <div class="credit"><span class="icon icon-camera"></span>[some person]</div>
            </figcaption>
        </figure>
    </li>
    [many more <li></li>'s]
</ul>

The regex aims to pull out [fullsize image url] and [some caption] and convert them into the following format:

<figure><img src="[fullsize image url]"/><figcaption>[some caption]</figcaption></figure>

The regex explained:

<li.*? data-src="(.*?)".*?>             # match '<li [other attrs] data-src="url" [other attrs]>' and store the URL
\s*<figure.*?>.*?(?:<figcaption         # match the <figure><figcaption> tags
.*?<div class="caption">(.*?)</div>     # match the caption div and store the text inside it
.*?</figcaption>)?\s*</figure>\s*</li>  # match all the closing tags to reduce false positives

Notes:

dugite-code commented 5 years ago

Fantastic explanation, thanks for the contribution