RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
6.98k stars 1.02k forks source link

Add JSON extraction embedded in HTML script element #4106

Open hkcomori opened 1 month ago

hkcomori commented 1 month ago

I want to extract JSON embedded in HTML script elements for processing by JSON dotpath. So I have added a format that outputs only bare content. Barejson is a term I coined because pure format names could not explain the behavior.. So if you have a better idea, I would like to adopt it.

This format can output only one item, so if more and less than one is found, an error will occur.

This is triggered by the following discussion: https://github.com/FreshRSS/FreshRSS/discussions/6406

dvikan commented 1 month ago

sorry i dont understand the use case here

maybe show example usage

hkcomori commented 1 month ago

I want to use JSON dotted path to get information from JSON embedded as a script element, such as the following on this page. It contains information on articles that should be RSS. It can be read from HTML, but some information are only in JSON.

<script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"workId":"018d6a5c-b9f2-77db-9191-e7cc6fbfdce2", ... }</script>

JSON must be separate from HTML because JSON dotted paths are not HTML readable. For this purpose, this PR feature extracts JSON, and the JSON dotted path processes the results.

hkcomori commented 1 month ago

XPathBridge example:

Enter web page URL: https://comic-walker.com/detail/KC_003160_S?episodeType=latest Item selector: //script[@id="__NEXT_DATA__"] Item title selector: "JSON" Item description selector: ./text() Use raw item description: true

hkcomori commented 1 month ago

Is it better to create bridges to extract RSS from embedded json instead of such format for intermediate files?

dvikan commented 1 month ago
  1. are you aware that there already exists a JsonFormat?

  2. Have you tested this PR and it does what you need?

hkcomori commented 1 month ago
  1. are you aware that there already exists a JsonFormat?

Of course, I first tried JsonFormat. I expected the following results:

{
    ...
    "content": {
        "key": "value"
    }
}

But in fact, the content was converted to a string and raw Json could not be extracted:

{
    ...
    "content": "{\"key\": \"value\"}"
}
  1. Have you tested this PR and it does what you need?

Yes. I confirmed that this result is raw json content and JSON dotted path can processes it.