Open MrCyjaneK opened 3 years ago
Hi, thanks for the help.
I have not yet managed to find a way to read JWPub files. Even if I change the words in the db they don't change in the app. I noticed that if I change a word it does not change in the text but only in the text search function. I have no clue how it works.
ah well. So I'm back to reverse engineering it again..
Me too. I started again this morning. This time I noticed some things that I didn't noticed before.
Now I arrived here:
UPDATE I think I could manage to extract the text. But only the text... I can't find any info regarding punctuation and size.
OH! If you can get the text that would he a huiegrnfdjksxfnwliekfncl (random letters of joy) help for me!
I'm soo happy that somebody figured out how to use it :D <3
Well yes, but I keep believing that it is not the correct way to do it
I'm actually out of luck and I have no idea what to do now.., So I just hope that you will figure out some method for it..
I think everything is in the 'Content' field of the 'Document' table but it is somehow encrypted and we cannot read it. Then two options remain: 1) Extract the text from the search table and invent up uppercase and punctuation; 2) Scraping of the text from the site.
And I would say that the first is not really the best. Even the second is not the best but I think it's the only way we have.
How would you do the 1
st thing? I believe that it is in fact stored in that table, and it is the correct way to extract the content, and with the leftover bytes we could possibly figure which stand for images/punctation/etc
Starting from the 'TextUnitIndices' column I take all the words that start with '80' (HEX). Then from the column 'PositionalList' the first word is the one that starts with '80', then the word with '81', etc... Reached 'FF' starts again with '00 81 ', '01 81', etc... After 'FF 81' I think there is '00 82' (but I have not yet tried). In the 'PositionalListIndex' column the first value (eg '85') indicates that the word is present 5 times so I could go to subtract that value until it reaches '80' and do not control that word for the current document (maybe I remove the '80' at the end of the cycle). Finally, in the 'TextUnitIndices' column I remove '80' and for all the rows decrease of 1 the next value (eg '81' -> '80', '85' -> '84', etc...). Then I start again the loop for the next document.
Ooookay! I'll try to write a parser for that! Thanks for help :D
You're welcome. Let me know if you find a way to get the punctuation
I will!
words like god
which probably are on every page, except for the table of content, have length of TextUnitIndices
row equal to 18 (in publication with 18 topics + Table Of Content, which I believe is different than the rest.
I'm actually curious what those lss
, lsr
, sqr
etc.. are, maybe formatting?
yay! I think we have a bit of it here :partying_face:
Sounds right, at least at the beginning
Reached 'FF' starts again with '00 81 ', '01 81', etc... After 'FF 81' I think there is '00 82' (but I have not yet tried).
right. That's why it's wrong
So - I'm out of luck today - It still produce wrong output, I'll start from scratch in the morning
UH.. do you have some example implementation..?
Nope, tomorrow I start to make one
I said tomorrow, here it is 1:10 am so technically it is the next day. Anyway I believe it does not work, it returns repeated words with and without accent
Sounds right, at least at the beginning
Here it returned correct thing for the first page, and then it just skipped the words that already were used. I believe that it is correct way to go.
After 'FF 81'
I was wrong, for some reason it stops at '7F 81' and restart from '00 82'
I did it! Or at least the first and the last words seem in place
YAYYYYYY
You are genius!
I'm looking forward to see this code :D, you saved my project :D Thank youuu
Here it is: JWPubExtractor.swift
Thanks!
So I've implemented an initial version of JWPUB parser (that doesn't work...), based on your JWPubExtractor.swift
https://github.com/MrCyjaneK/jwapi/blob/jwpub/libjw/jwpub.go
Running this file: https://github.com/MrCyjaneK/jwapi/blob/jwpub/utils/jwpub-test/parse.go should parse the fg_E
publication
But I guess that I've made some issue when I was rewriting JWPubExtractor.swift
in golang
that's what I'm getting from the same publication as you, but in english
I've never used Golang but the code looks ok. Can you check if 'curDocIndex' increases correctly and if the values are removed from 'PositionalList'?
I've spotted the issue already.. turns out https://github.com/MrCyjaneK/jwapi/blob/jwpub/libjw/jwpub.go#L190 this function was wrong...
Now I'm getting first docID correctly:
fullText[docID: 0 ]: study edition october 2021 study articles for december 6 2021 january 2 2022 © 2021 watch tower bible and tract society of pennsylvania this publication is not for sale it is provided as part of a worldwide bible educational work supported by voluntary donations to make a donation please visit donate jw org unless otherwise indicated scripture quotations are from the modern-language new world translation of the holy scriptures cover picture although noah preached faithfully for many years no one joined him in the ark except for his immediate family even so noah was successful in obeying god see study article 43 paragraph 11 image cvr label image lsr label image lss label image sqr label image sqs label image pnr label image pns label
After that the code panics, but I think I'll be able to fix it
Now it's acting like a weirdo, lol, it do the first docID correctly, remove items correctly, then it won't extract text, but will remove items like it should do.
Update from the content field: 1) It's not a file, it's html text compressed someway 2) If I move a field from a row to another it works 3) If I duplicate the content (copy - paste at the end) it takes only the first
I duplicated the content e removed the last 16 bytes from the first part and the first 16 bytes from the second part. I broke it but still got a good result:
Hm I can look at the mobile app source and check if it import some compression library.
But binwalk
didn't return anything interesting so I guess that it's just something they made up
I don't know. It's really strange. The size of the blob is less than what should be. And if it's html, why the beginning and the end are different for each field? Another strange thing is that the content of a publication doesn't work on another (or at least the one I tried). If it's something they made up we can't go further
Looks like we may get a copyright strike for what we have done...
So we can't use JWPub files (also because we can't read them). What about using EPUB or HTML?
I'm already using epub, but they dropped support for it
https://www.jw.org/en/library/books/enjoy-life-forever/ for example this publication it's PDF/JWPUB only
And I feel it deep inside that wol.jw.org will be replaced with JW Library app sooner than we expect... And there is no way I'm installing anything that is not open source on any on my devices. SOooo: using jwpub is the only thing we can do...
https://pastebin.com/2HrrF43D I FINALLY got it working ( I think lol! ). I just ended with copy-pasting swift code and fixing compile errors ;-P
So what we can do now is
Content
I think Content
is encrypted with AES, maybe AES-256
:duck: so no luck for us, but if it would be aes'd then the content shouldn't change, it should just break
It's made of pieces of 16 bytes. You can duplicate this pieces but if you change one byte it doesn't work
https://github.com/Miaosi001/JW-Library-macOS/blob/3f39b4a386ba00f52432607a745d7f0b9dcb9db1/JWLibrary/Utility/JWPubManager.swift#L45-51
Here you can get latest catalog version: https://app.jw-cdn.org/catalogs/publications/v4/manifest.json
and with that version go to
https://app.jw-cdn.org/catalogs/publications/v4/ + version + /catalog.db.gz
Also, I've seen JWPUB manager.. have you figured out how to read them? (I'm not into swift, and I don't even own an :apple: device)..
It took me couple of days with no results to get content out of the jwpub. https://github.com/MrCyjaneK/jwapi/issues/1