edrlab / thorium-reader

A cross platform desktop reading app, based on the Readium Desktop toolkit
https://www.edrlab.org/software/thorium-reader/
BSD 3-Clause "New" or "Revised" License
1.73k stars 151 forks source link

crash when import book on bnf.fr #743

Closed panaC closed 4 years ago

panaC commented 4 years ago

feed :: bnf.fr (gallica): "http://gallica.bnf.fr/opds"

click on derniere mise en ligne:

image

image

Screenshot 2019-09-27 at 16 08 58

  readium-desktop:main#services/catalog [START] Download publication http://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub +0ms
  readium-desktop:main#services/downloader Error while downloading resource { identifier: '3179cf36-3e7f-4174-a6e5-a9653ec12614',
  srcUrl: 'http://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub',
  dstPath:
   '/var/folders/yx/scz0d3rj3h3c597mz07pnl_w0000gp/T/readium-desktop-68735fllu64orVQix.epub.part',
  progress: 0,
  downloadedSize: 0,
  status: 3 } 500 +0ms
  readium-desktop:main#redux/sagas/api Error while downloading resource +846ms
danielweck commented 4 years ago

Source URL: http://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub Fetched filename: Contes_fantastiques_(3e_éd_)_[...]Erckmann-Chatrian_Auteur_bpt6k65669461.epub (1.2MB)

I can reproduce the fatal error:

Uncaught Exception:
Error [ERR_STREAM_WRITE_AFTER_END]: write after end
    at writeAfterEnd (_stream_writable.js:243:12)
    at WriteStream.Writable.write (_stream_writable.js:291:5)
    at IncomingMessage.response.on (/Volumes/500GB/Code/readium-desktop/dist/main.js:5476:38)
    at IncomingMessage.emit (events.js:187:15)
    at addChunk (_stream_readable.js:279:12)
    at readableAddChunk (_stream_readable.js:264:11)
    at IncomingMessage.Readable.push (_stream_readable.js:219:10)
    at HTTPParser.parserOnBody (_http_common.js:124:22)
    at TLSSocket.socketOnData (_http_client.js:432:20)
    at TLSSocket.emit (events.js:182:13)
    at addChunk (_stream_readable.js:279:12)
    at readableAddChunk (_stream_readable.js:264:11)
    at TLSSocket.Readable.push (_stream_readable.js:219:10)
    at TLSWrap.onread (net.js:636:20)
[2]   readium-desktop:sync ### action type API_REQUEST +1ms
[2]   readium-desktop:main#redux/sagas/api publication importOpdsEntry [ null,
[2]   'eyJNZXRhZGF0YSI6eyJUaXRsZSI6IkNvbnRlcyBmYW50YXN0aXF1ZXMgKDNlIMOpZC4pIC8gcGFyIEVyY2ttYW5uLUNoYXRyaWFuLi4uIiwiSWRlbnRpZmllciI6Imh0dHA6Ly9nYWxsaWNhLmJuZi5mci9hcms6LzEyMTQ4L2JwdDZrNjU2Njk0NjEiLCJNb2RpZmllZCI6IjIwMTktMDMtMDNUMDA6MDA6MDAuMDAwWiIsIlB1Ymxpc2hlciI6W3siTmFtZSI6InB1Ymxpc2hlciJ9XSwiQXV0aG9yIjpbeyJOYW1lIjoiRXJja21hbm4tQ2hhdHJpYW4gIiwiSWRlbnRpZmllciI6Imh0dHBzOi8vZ2FsbGljYS5ibmYuZnIvb3Bkcz9xdWVyeT1kYy5jcmVhdG9yIGFsbCBcIkVyY2ttYW5uLUNoYXRyaWFuIFwiIn1dLCJEZXNjcmlwdGlvbiI6IkNvbnRpZW50IHVuZSB0YWJsZSBkZXMgbWF0acOocmVzQXZlYyBtb2RlIHRleHRlIn0sIkxpbmtzIjpbeyJIcmVmIjoiaHR0cDovL2dhbGxpY2EuYm5mLmZyL2FyazovMTIxNDgvYnB0Nms2NTY2OTQ2MSIsIlR5cGVMaW5rIjoidGV4dC9odG1sIiwiUmVsIjpbImFsdGVybmF0ZSJdLCJUaXRsZSI6IlZvaXIgc3VyIEdhbGxpY2EifSx7IkhyZWYiOiJodHRwOi8vZ2FsbGljYS5ibmYuZnIvYXJrOi8xMjE0OC9icHQ2azY1NjY5NDYxLmVwdWIiLCJUeXBlTGluayI6ImFwcGxpY2F0aW9uL2VwdWIremlwIiwiUmVsIjpbImh0dHA6Ly9vcGRzLXNwZWMub3JnL2FjcXVpc2l0aW9uIl0sIlRpdGxlIjoiT3V2cmlyIGxlIGxpdnJlIn1dLCJJbWFnZXMiOlt7IkhyZWYiOiJodHRwOi8vZ2FsbGljYS5ibmYuZnIvYXJrOi8xMjE0OC9icHQ2azY1NjY5NDYxLnRodW1ibmFpbCIsIlR5cGVMaW5rIjoiaW1hZ2UvanBlZyIsIlJlbCI6WyJodHRwOi8vb3Bkcy1zcGVjLm9yZy9pbWFnZS90aHVtYm5haWwiXX0seyJIcmVmIjoiaHR0cDovL2dhbGxpY2EuYm5mLmZyL2FyazovMTIxNDgvYnB0Nms2NTY2OTQ2MS5oaWdocmVzIiwiVHlwZUxpbmsiOiJpbWFnZS9qcGVnIiwiUmVsIjpbImh0dHA6Ly9vcGRzLXNwZWMub3JnL2ltYWdlIl19XX0=',
[2]   'Contes fantastiques (3e éd.) / par Erckmann-Chatrian...',
[2]   null ] +19ms
[2]   readium-desktop:sync ### action type DOWNLOAD_REQUEST +0ms
[2]   readium-desktop:sync ### action type TOAST_OPEN_REQUEST +1ms
[2]   readium-desktop:main#services/catalog [START] Download publication http://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub +0ms
[2]   readium-desktop:main#services/downloader Error while downloading resource { identifier: 'f8319469-996c-4d19-8619-70e992eab631',
[2]   srcUrl: 'http://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub',
[2]   dstPath:
[2]    '/var/folders/f4/bs_1cm7565jdq6tzs0n01h_40000gn/T/readium-desktop-166112MHtU6bwhYnm.epub.part',
[2]   progress: 0,
[2]   downloadedSize: 0,
[2]   status: 3 } 500 +0ms
[2]   readium-desktop:main#redux/sagas/api Error while downloading resource +528ms
[2]   readium-desktop:sync ### action type API_ERROR +527ms
[2]   readium-desktop:renderer:bookshelf_ Error to fetch api publication/importOpdsEntry undefined +8s
danielweck commented 4 years ago

We are getting HTTP status code 500 when downloading from Thorium, but it works from the web browser. So I suspect our HTTP request headers are missing something that the Galica server wants (e.g. user-agent, origin, etc.). https://github.com/readium/readium-desktop/blob/e65e2a5ce848aae05119148d85592989a553a078/src/main/services/downloader.ts#L93-L103 Note that I think there is also a bug in the if (response.statusCode < 200 || response.statusCode > 299) conditional statement: return is missing!! (otherwise response.on("data") will run!)

danielweck commented 4 years ago

curl -s -L -I -X GET http://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub

HTTP/1.1 301 Moved Permanently
Date: Fri, 27 Sep 2019 22:03:04 GMT
Server: Apache
Location: https://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub
Content-Length: 260
Content-Type: text/html; charset=iso-8859-1

HTTP/2 200 
date: Fri, 27 Sep 2019 22:03:05 GMT
server: Apache
set-cookie: JSESSIONID=1B35E5BA6697AFE3EE2EB8C408A83545; Path=/; Secure; HttpOnly
content-disposition: inline;filename="Contes_fantastiques_(3e_?d_)_[...]Erckmann-Chatrian_Auteur_bpt6k65669461.epub"
content-type: application/epub+zip;charset=UTF-8
content-language: fr-FR
content-length: 1211216
vary: Accept-Encoding,User-Agent
danielweck commented 4 years ago

HTTP2: curl -s -L -v https://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub

*   Trying 194.199.8.11...
* TCP_NODELAY set
* Connected to gallica.bnf.fr (194.199.8.11) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=FR; ST=Ile de France; L=Paris; O=Bibliotheque Nationale de France; CN=*.bnf.fr
*  start date: May 28 00:00:00 2018 GMT
*  expire date: Jun  1 12:00:00 2020 GMT
*  subjectAltName: host "gallica.bnf.fr" matched cert's "*.bnf.fr"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fd20a006600)
> GET /ark:/12148/bpt6k65669461.epub HTTP/2
> Host: gallica.bnf.fr
> User-Agent: curl/7.54.0
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200 
< date: Fri, 27 Sep 2019 22:04:34 GMT
< server: Apache
< set-cookie: JSESSIONID=B1DF9D367BC95DAD2373C307E98A60F6; Path=/; Secure; HttpOnly
< content-disposition: inline;filename="Contes_fantastiques_(3e_?d_)_[...]Erckmann-Chatrian_Auteur_bpt6k65669461.epub"
< content-type: application/epub+zip;charset=UTF-8
< content-language: fr-FR
< content-length: 1211216
< vary: Accept-Encoding,User-Agent
danielweck commented 4 years ago

Minimal repro test case: https://repl.it/languages/nodejs with:

const request = require("request");
const requestStream = request.get(" https://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub",{timeout: 5000});
requestStream.on("error", (error) => { console.log(error); });
requestStream.on("response", (response) => { console.log(response.statusCode); });

=> 500

danielweck commented 4 years ago

As I suspected, the missing User-Agent HTTP header is the cause. Repro: https://repl.it/languages/nodejs

const request = require("request");
const requestStream = request.get("https://gallica.bnf.fr/ark:/12148/bpt6k65669461.epub", {
  timeout: 5000,
  headers: {
    'User-Agent': 'Thorium'
  }
});
// requestStream.on("error", (error) => { console.log(error); });
requestStream.on("response", (response) => { console.log(response.statusCode); });

=> 200