Closed mzltest closed 1 year ago
I think there is no way to deal with it apart having a custom handler on FTR or graby which will grab the content of the meta, parse it and then interpret it. WDYT @fivefilters?
Thanks for the reply.After some test it seems in this example regex can do the job of JSON extraction (assuming JSON is minified and does not include extra spaces):(?<=\<meta\s+name="preload-data"\s+id="meta-preload-data"\s+content=['|"])(.+?)(?=['|"]>)
And for reading the article content it can be (?<=\<meta\s+name="preload-data"\s+id="meta-preload-data"\s+content=['|"].+?"content"\:")(.+?)(?=",")
(though I think this method is not reliable)
Sorry for the slow reply. It's not possible to do much with JSON in Full-Text RSS at the moment. It might be useful to add support for it at some point, as it's much lighter than the alternative of loading a headless browser and waiting for the Javascript to execute. The downside of course is that it's unlikely that the content can be automatically extracted, and you'll probably need additional selectors for the JSON object.
In a similar project (RSSHub),the parsing rule is pretty flexibility as the parsing rules are javascript scripts ( https://docs.rsshub.app/en/joinus/quick-start.html#submit-new-rss-rule ) and they provide puppeteer as a last resort.
However filters here are a set of commands and can't really do much at the moment ,and AFAIK PHP don't have good JavaScript support (V8 extension is available only at compiling level), probably it will have issues for JavaScript-intensive page or APIs that require decryption.
Anyway thanks for the good project for making webpages more accessible.
Let me explain my issue with an example. https://www.pixiv.net/novel/show.php?id=7943012 By using the fivefilter build-your-own-config I could get this html: https://gist.github.com/mzltest/60038f4326e13d07653a6e43142c3cf0 (well if you are interested,I don't understand Japanese neither,I just pulled an random example)
Apparently the data is not in DOM so Xpath would be useless,however if you look into the tag
<meta name="preload-data" id="meta-preload-data" content=
you can find some interesting data.Actaully the content is in that tag with data stored in json. So is there a way to parse that json?The easiest way(and more faulty if not careful) is regex however when I read the documentation I only find replace-string function which I think is probably not enough.Seems these kind of data is getting popular in modern webs.And I found some websites store the data in a
<script>window.__INITIAL_STATE__=
as a JavaScript object.Is there a way to parse it?