fivefilters / ftr-site-config

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
https://www.fivefilters.org/full-text-rss/
Other
366 stars 255 forks source link

how to deal with JavaScript objects/json parsing #1008

Closed mzltest closed 1 year ago

mzltest commented 1 year ago

Let me explain my issue with an example. https://www.pixiv.net/novel/show.php?id=7943012 By using the fivefilter build-your-own-config I could get this html: https://gist.github.com/mzltest/60038f4326e13d07653a6e43142c3cf0 (well if you are interested,I don't understand Japanese neither,I just pulled an random example)

Apparently the data is not in DOM so Xpath would be useless,however if you look into the tag <meta name="preload-data" id="meta-preload-data" content= you can find some interesting data.Actaully the content is in that tag with data stored in json. So is there a way to parse that json?The easiest way(and more faulty if not careful) is regex however when I read the documentation I only find replace-string function which I think is probably not enough.

Seems these kind of data is getting popular in modern webs.And I found some websites store the data in a <script>window.__INITIAL_STATE__= as a JavaScript object.Is there a way to parse it?

j0k3r commented 1 year ago

I think there is no way to deal with it apart having a custom handler on FTR or graby which will grab the content of the meta, parse it and then interpret it. WDYT @fivefilters?

mzltest commented 1 year ago

Thanks for the reply.After some test it seems in this example regex can do the job of JSON extraction (assuming JSON is minified and does not include extra spaces):(?<=\<meta\s+name="preload-data"\s+id="meta-preload-data"\s+content=['|"])(.+?)(?=['|"]>) And for reading the article content it can be (?<=\<meta\s+name="preload-data"\s+id="meta-preload-data"\s+content=['|"].+?"content"\:")(.+?)(?=",")(though I think this method is not reliable)

fivefilters commented 1 year ago

Sorry for the slow reply. It's not possible to do much with JSON in Full-Text RSS at the moment. It might be useful to add support for it at some point, as it's much lighter than the alternative of loading a headless browser and waiting for the Javascript to execute. The downside of course is that it's unlikely that the content can be automatically extracted, and you'll probably need additional selectors for the JSON object.

mzltest commented 1 year ago

In a similar project (RSSHub),the parsing rule is pretty flexibility as the parsing rules are javascript scripts ( https://docs.rsshub.app/en/joinus/quick-start.html#submit-new-rss-rule ) and they provide puppeteer as a last resort.

However filters here are a set of commands and can't really do much at the moment ,and AFAIK PHP don't have good JavaScript support (V8 extension is available only at compiling level), probably it will have issues for JavaScript-intensive page or APIs that require decryption.

Anyway thanks for the good project for making webpages more accessible.