darioragusa / JW-Library-macOS

JW Library per macOS
MIT License
10 stars 1 forks source link

Idea #1

Open MrCyjaneK opened 3 years ago

MrCyjaneK commented 3 years ago

https://github.com/Miaosi001/JW-Library-macOS/blob/3f39b4a386ba00f52432607a745d7f0b9dcb9db1/JWLibrary/Utility/JWPubManager.swift#L45-51

Here you can get latest catalog version: https://app.jw-cdn.org/catalogs/publications/v4/manifest.json

and with that version go to https://app.jw-cdn.org/catalogs/publications/v4/ + version + /catalog.db.gz

Also, I've seen JWPUB manager.. have you figured out how to read them? (I'm not into swift, and I don't even own an :apple: device)..

It took me couple of days with no results to get content out of the jwpub. https://github.com/MrCyjaneK/jwapi/issues/1

darioragusa commented 3 years ago

Hi, thanks for the help.

I have not yet managed to find a way to read JWPub files. Even if I change the words in the db they don't change in the app. I noticed that if I change a word it does not change in the text but only in the text search function. I have no clue how it works.

MrCyjaneK commented 3 years ago

ah well. So I'm back to reverse engineering it again..

darioragusa commented 3 years ago

Me too. I started again this morning. This time I noticed some things that I didn't noticed before.

Now I arrived here: image

UPDATE I think I could manage to extract the text. But only the text... I can't find any info regarding punctuation and size.

MrCyjaneK commented 3 years ago

OH! If you can get the text that would he a huiegrnfdjksxfnwliekfncl (random letters of joy) help for me!

MrCyjaneK commented 3 years ago

I'm soo happy that somebody figured out how to use it :D <3

darioragusa commented 3 years ago

Well yes, but I keep believing that it is not the correct way to do it

MrCyjaneK commented 3 years ago

I'm actually out of luck and I have no idea what to do now.., So I just hope that you will figure out some method for it..

darioragusa commented 3 years ago

I think everything is in the 'Content' field of the 'Document' table but it is somehow encrypted and we cannot read it. Then two options remain: 1) Extract the text from the search table and invent up uppercase and punctuation; 2) Scraping of the text from the site.

And I would say that the first is not really the best. Even the second is not the best but I think it's the only way we have.

MrCyjaneK commented 3 years ago

How would you do the 1st thing? I believe that it is in fact stored in that table, and it is the correct way to extract the content, and with the leftover bytes we could possibly figure which stand for images/punctation/etc

darioragusa commented 3 years ago

Starting from the 'TextUnitIndices' column I take all the words that start with '80' (HEX). Then from the column 'PositionalList' the first word is the one that starts with '80', then the word with '81', etc... Reached 'FF' starts again with '00 81 ', '01 81', etc... After 'FF 81' I think there is '00 82' (but I have not yet tried). In the 'PositionalListIndex' column the first value (eg '85') indicates that the word is present 5 times so I could go to subtract that value until it reaches '80' and do not control that word for the current document (maybe I remove the '80' at the end of the cycle). Finally, in the 'TextUnitIndices' column I remove '80' and for all the rows decrease of 1 the next value (eg '81' -> '80', '85' -> '84', etc...). Then I start again the loop for the next document.

MrCyjaneK commented 3 years ago

Ooookay! I'll try to write a parser for that! Thanks for help :D

darioragusa commented 3 years ago

You're welcome. Let me know if you find a way to get the punctuation

MrCyjaneK commented 3 years ago

I will!

MrCyjaneK commented 3 years ago

image words like god which probably are on every page, except for the table of content, have length of TextUnitIndices row equal to 18 (in publication with 18 topics + Table Of Content, which I believe is different than the rest.

I'm actually curious what those lss, lsr, sqr etc.. are, maybe formatting?

MrCyjaneK commented 3 years ago

image yay! I think we have a bit of it here :partying_face:

MrCyjaneK commented 3 years ago

image

Sounds right, at least at the beginning

MrCyjaneK commented 3 years ago

Reached 'FF' starts again with '00 81 ', '01 81', etc... After 'FF 81' I think there is '00 82' (but I have not yet tried).

right. That's why it's wrong

MrCyjaneK commented 3 years ago

So - I'm out of luck today - It still produce wrong output, I'll start from scratch in the morning

MrCyjaneK commented 3 years ago

UH.. do you have some example implementation..?

darioragusa commented 3 years ago

Nope, tomorrow I start to make one

darioragusa commented 3 years ago

I said tomorrow, here it is 1:10 am so technically it is the next day. Anyway I believe it does not work, it returns repeated words with and without accent

MrCyjaneK commented 3 years ago

image

Sounds right, at least at the beginning

Here it returned correct thing for the first page, and then it just skipped the words that already were used. I believe that it is correct way to go.

darioragusa commented 3 years ago

After 'FF 81'

I was wrong, for some reason it stops at '7F 81' and restart from '00 82'

darioragusa commented 3 years ago

I did it! Or at least the first and the last words seem in place

darioragusa commented 3 years ago

https://user-images.githubusercontent.com/46404000/129208516-009c5009-aebe-4d98-8cc9-1f92f4f8a7a8.mp4

🥳🥳🥳

MrCyjaneK commented 3 years ago

YAYYYYYY

MrCyjaneK commented 3 years ago

You are genius!

MrCyjaneK commented 3 years ago

I'm looking forward to see this code :D, you saved my project :D Thank youuu

darioragusa commented 3 years ago

Here it is: JWPubExtractor.swift

MrCyjaneK commented 3 years ago

Thanks!

MrCyjaneK commented 3 years ago

So I've implemented an initial version of JWPUB parser (that doesn't work...), based on your JWPubExtractor.swift

https://github.com/MrCyjaneK/jwapi/blob/jwpub/libjw/jwpub.go

Running this file: https://github.com/MrCyjaneK/jwapi/blob/jwpub/utils/jwpub-test/parse.go should parse the fg_E publication

image But I guess that I've made some issue when I was rewriting JWPubExtractor.swift in golang

MrCyjaneK commented 3 years ago

Screenshot_20210813_120450 that's what I'm getting from the same publication as you, but in english

darioragusa commented 3 years ago

I've never used Golang but the code looks ok. Can you check if 'curDocIndex' increases correctly and if the values are removed from 'PositionalList'?

MrCyjaneK commented 3 years ago

I've spotted the issue already.. turns out https://github.com/MrCyjaneK/jwapi/blob/jwpub/libjw/jwpub.go#L190 this function was wrong...

Now I'm getting first docID correctly: fullText[docID: 0 ]: study edition october 2021 study articles for december 6 2021 january 2 2022 © 2021 watch tower bible and tract society of pennsylvania this publication is not for sale it is provided as part of a worldwide bible educational work supported by voluntary donations to make a donation please visit donate jw org unless otherwise indicated scripture quotations are from the modern-language new world translation of the holy scriptures cover picture although noah preached faithfully for many years no one joined him in the ark except for his immediate family even so noah was successful in obeying god see study article 43 paragraph 11 image cvr label image lsr label image lss label image sqr label image sqs label image pnr label image pns label

After that the code panics, but I think I'll be able to fix it

MrCyjaneK commented 3 years ago

Now it's acting like a weirdo, lol, it do the first docID correctly, remove items correctly, then it won't extract text, but will remove items like it should do.

darioragusa commented 3 years ago

Update from the content field: 1) It's not a file, it's html text compressed someway 2) If I move a field from a row to another it works 3) If I duplicate the content (copy - paste at the end) it takes only the first

I duplicated the content e removed the last 16 bytes from the first part and the first 16 bytes from the second part. I broke it but still got a good result: image

MrCyjaneK commented 3 years ago

Hm I can look at the mobile app source and check if it import some compression library.

MrCyjaneK commented 3 years ago

But binwalk didn't return anything interesting so I guess that it's just something they made up

darioragusa commented 3 years ago

I don't know. It's really strange. The size of the blob is less than what should be. And if it's html, why the beginning and the end are different for each field? Another strange thing is that the content of a publication doesn't work on another (or at least the one I tried). If it's something they made up we can't go further

MrCyjaneK commented 3 years ago

image

Looks like we may get a copyright strike for what we have done...

source

darioragusa commented 3 years ago

image

darioragusa commented 3 years ago

So we can't use JWPub files (also because we can't read them). What about using EPUB or HTML?

MrCyjaneK commented 3 years ago

I'm already using epub, but they dropped support for it

MrCyjaneK commented 3 years ago

https://www.jw.org/en/library/books/enjoy-life-forever/ for example this publication it's PDF/JWPUB only

MrCyjaneK commented 3 years ago

And I feel it deep inside that wol.jw.org will be replaced with JW Library app sooner than we expect... And there is no way I'm installing anything that is not open source on any on my devices. SOooo: using jwpub is the only thing we can do...

MrCyjaneK commented 3 years ago

https://pastebin.com/2HrrF43D I FINALLY got it working ( I think lol! ). I just ended with copy-pasting swift code and fixing compile errors ;-P

MrCyjaneK commented 3 years ago

So what we can do now is

darioragusa commented 3 years ago

I think Content is encrypted with AES, maybe AES-256

MrCyjaneK commented 3 years ago

:duck: so no luck for us, but if it would be aes'd then the content shouldn't change, it should just break

darioragusa commented 3 years ago

It's made of pieces of 16 bytes. You can duplicate this pieces but if you change one byte it doesn't work