edrlab / thorium-reader

A cross platform desktop reading app, based on the Readium Desktop toolkit
https://www.edrlab.org/software/thorium-reader/
BSD 3-Clause "New" or "Revised" License
1.83k stars 155 forks source link

OpenSearch failure with desLibris feed (OPDS1 XML) #1383

Closed danielweck closed 3 years ago

danielweck commented 3 years ago

https://api.deslibris.ca/api/feed

=>

<link rel="search" type="application/opensearchdescription+xml" title="Search on desLibris" href="https://api.deslibris.ca/opensearch-feed.xml" />

...

https://api.deslibris.ca/opensearch-feed.xml

=>

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>desLibris</ShortName>
  <Description>Search on desLibris</Description>
  <InputEncoding>UTF-8</InputEncoding>
  <OutputEncoding>UTF-8</OutputEncoding>
  <Image type="image/x-icon" width="16" height="16">https://deslibris.ca/favicon.ico</Image>
  <Url type="application/atom+xml" template="https://api.deslibris.ca/api/feed/search/{searchTerms}"/>
  <Url type="application/atom+xml;profile=opds-catalog;kind=acquisition" template="https://api.deslibris.ca/api/feed/search/{searchTerms}"/>
  <Query role="example" searchTerms="robot" />
</OpenSearchDescription>

...

https://api.deslibris.ca/api/feed/search/{searchTerms}

danielweck commented 3 years ago

image001

llemeurfr commented 3 years ago

Note: search would work if the Url was using query params (e.g. https://api.deslibris.ca/api/feed/search?q={searchTerms}).

But as OpenSearch allows a url form that does not use query params, Thorium's code must be adapted to this use case.

danielweck commented 3 years ago

Technical notes:

OpenSearch URL Template mechanism: https://github.com/dewitt/opensearch/blob/master/opensearch-1-1-draft-6.md

... is not RFC 6570 URI Template! (different parsing grammar, semantics):

https://tools.ietf.org/html/rfc6570

Feedbooks example:

https://catalog.feedbooks.com/opensearch.xml

=>

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>Feedbooks</ShortName>
  <Description>Search on Feedbooks</Description>
  <InputEncoding>UTF-8</InputEncoding>
  <OutputEncoding>UTF-8</OutputEncoding>
  <Image type="image/x-icon" width="16" height="16">http://www.feedbooks.com/favicon.ico</Image>
  <Url type="text/html" template="https://catalog.feedbooks.com/search.html?query={searchTerms}"/>
  <Url type="application/atom+xml" template="https://catalog.feedbooks.com/search.atom?query={searchTerms}"/>
  <Url type="application/atom+xml;profile=opds-catalog;kind=acquisition" template="https://catalog.feedbooks.com/search.atom?query={searchTerms}"/>
  <Query role="example" searchTerms="robot" />
</OpenSearchDescription>

=>

https://catalog.feedbooks.com/search.atom?query={searchTerms}

versus:

https://catalog.feedbooks.com/catalog/index.json

=>

{
"metadata":{"title":"Feedbooks"},
"links":[
{"type":"application/opds+json","rel":"self","href":"https://catalog.feedbooks.com/catalog/index.json"},
{"type":"application/opds+json","rel":"search","href":"https://catalog.feedbooks.com/search.json{?query}","templated":true}

...

=>

https://catalog.feedbooks.com/search.json{?query}

danielweck commented 3 years ago

Current code is naïve (but somewhat reasonable) search+replace:

https://github.com/edrlab/thorium-reader/blob/48198db7a6d3e982f819f31d84e686a56b29e54f/src/renderer/library/components/opds/SearchForm.tsx#L123

https://github.com/edrlab/thorium-reader/blob/48198db7a6d3e982f819f31d84e686a56b29e54f/src/renderer/library/redux/sagas/opds.ts#L26

danielweck commented 3 years ago

Related issue: https://github.com/edrlab/thorium-reader/issues/1382

danielweck commented 3 years ago

Note: search would work if the Url was using query params

Are we sure about that? Or is this just conjecture?

danielweck commented 3 years ago

I think it is a timeout issue (Saga race condition). Additional HTTP request to OpenSearch XML succeeds, but too late.

https://github.com/edrlab/thorium-reader/blob/48198db7a6d3e982f819f31d84e686a56b29e54f/src/renderer/library/redux/sagas/opds.ts#L92-L95

danielweck commented 3 years ago

Actually, not a timeout issue. This fails:

https://github.com/edrlab/thorium-reader/blob/48198db7a6d3e982f819f31d84e686a56b29e54f/src/renderer/library/redux/sagas/opds.ts#L136

with:

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>desLibris</ShortName>
  <Description>Search on desLibris</Description>
  <InputEncoding>UTF-8</InputEncoding>
  <OutputEncoding>UTF-8</OutputEncoding>
  <Image type="image/x-icon" width="16" height="16">https://deslibris.ca/favicon.ico</Image>
  <Url type="application/atom+xml" template="https://api.deslibris.ca/api/feed/search/{searchTerms}"/>
  <Url type="application/atom+xml;profile=opds-catalog;kind=acquisition" template="https://api.deslibris.ca/api/feed/search/{searchTerms}"/>
  <Query role="example" searchTerms="robot" />
</OpenSearchDescription>

...but succeeds with:

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>Feedbooks</ShortName>
  <Description>Search on Feedbooks</Description>
  <InputEncoding>UTF-8</InputEncoding>
  <OutputEncoding>UTF-8</OutputEncoding>
  <Image type="image/x-icon" width="16" height="16">http://www.feedbooks.com/favicon.ico</Image>
  <Url type="text/html" template="https://catalog.feedbooks.com/search.html?query={searchTerms}"/>
  <Url type="application/atom+xml" template="https://catalog.feedbooks.com/search.atom?query={searchTerms}"/>
  <Url type="application/atom+xml;profile=opds-catalog;kind=acquisition" template="https://catalog.feedbooks.com/search.atom?query={searchTerms}"/>
  <Query role="example" searchTerms="robot" />
</OpenSearchDescription>
danielweck commented 3 years ago

I ran an Electron Fiddle ( https://www.electronjs.org/fiddle ) with this renderer process code:

const xmlSrc1 = `<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>desLibris</ShortName>
  <Description>Search on desLibris</Description>
  <InputEncoding>UTF-8</InputEncoding>
  <OutputEncoding>UTF-8</OutputEncoding>
  <Image type="image/x-icon" width="16" height="16">https://deslibris.ca/favicon.ico</Image>
  <Url type="application/atom+xml" template="https://api.deslibris.ca/api/feed/search/{searchTerms}"/>
  <Url type="application/atom+xml;profile=opds-catalog;kind=acquisition" template="https://api.deslibris.ca/api/feed/search/{searchTerms}"/>
  <Query role="example" searchTerms="robot" />
</OpenSearchDescription>`;

const xmlSrc2 = `<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>Feedbooks</ShortName>
  <Description>Search on Feedbooks</Description>
  <InputEncoding>UTF-8</InputEncoding>
  <OutputEncoding>UTF-8</OutputEncoding>
  <Image type="image/x-icon" width="16" height="16">http://www.feedbooks.com/favicon.ico</Image>
  <Url type="text/html" template="https://catalog.feedbooks.com/search.html?query={searchTerms}"/>
  <Url type="application/atom+xml" template="https://catalog.feedbooks.com/search.atom?query={searchTerms}"/>
  <Url type="application/atom+xml;profile=opds-catalog;kind=acquisition" template="https://catalog.feedbooks.com/search.atom?query={searchTerms}"/>
  <Query role="example" searchTerms="robot" />
</OpenSearchDescription>`;

const xmlDom1 = (new DOMParser()).parseFromString(xmlSrc1, "application/xml");
console.log(xmlDom1);
const urls1 = xmlDom1.documentElement.querySelectorAll("Url");
console.log(JSON.stringify(urls1, null, 4));

const xmlDom2 = (new DOMParser()).parseFromString(xmlSrc2, "application/xml");
console.log(xmlDom2);
const urls2 = xmlDom2.documentElement.querySelectorAll("Url");
console.log(JSON.stringify(urls2, null, 4));

...and everything works fine.

?!

danielweck commented 3 years ago

Ah, got it! (classic silent XML parsing error with DOMParser)

error on line 1 at column 6: XML declaration allowed only at the start of the document

UTF8 BOM issue, or bad encoding, I think

danielweck commented 3 years ago

YES :(

Buffer.from(searchRaw).toString("hex")

=>

3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d38223f3e0a3c4f70656e5365617263684465736372697074696f6e20786d6

...but there is a efbbbf prefix for desLibris, but not Feedbooks

https://en.wikipedia.org/wiki/Byte_order_mark

danielweck commented 3 years ago

I'm fixing this now.

danielweck commented 3 years ago

The main feed also has a BOM, but we use xmldom to parse in the main process, not DOMParser (Chromium), so that's fine.

curl -s https://api.deslibris.ca/api/feed | hexdump | head => 0000000 ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e

curl -s https://api.deslibris.ca/opensearch-feed.xml | hexdump | head => 0000000 ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e

danielweck commented 3 years ago

Compare with Feedbooks:

curl -s https://catalog.feedbooks.com/catalog/index.atom | hexdump | head => 0000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31

curl -s https://catalog.feedbooks.com/opensearch.xml | hexdump | head => 0000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31

danielweck commented 3 years ago

Will be fixed by https://github.com/edrlab/thorium-reader/pull/1385