felipegiacomozzi / the-trove-downloader

Downloads files from The Trove Rpg website
50 stars 3 forks source link

Some info on what it looks like is happening with last two tickets (object reference and my ticket) #21

Open turnerjoy opened 3 years ago

turnerjoy commented 3 years ago

It looks like they are now using some linking or "commands" probably to decrease the amount of deduplication.

For Example:

in this folder:

https://thetrove.is/Books//%20Collections//Collaborative%20%26%20Peer%20%26%20Gm-less%20%26%20Shifting%20GM/

The Archipelago folder link appears to go to:

`https://thetrove.is/Books//%20Collections//Collaborative%20%26%20Peer%20%26%20Gm-less%20%26%20Shifting%20GM/Archipelago/

But it actually goes to https://thetrove.is/Books/Archipelago/

There seems to be some straight up bad links also like:

in https://thetrove.is/Tabletop%20Games/BoardGames/Azhanti%20High%20Lightning/

AzhantiHighLightning.jpg goes to a 404

beowulf88 commented 3 years ago

Yes I noticed the same thing with the D&D/Magazines folder which instead of going to Home/Books/D&D/Magazines goes to Home/Magazines instead. So I guess I just have to skip these folders to avoid an error, if I can get the skip folder option to work

NexusEye commented 3 years ago

Having the same problems. Possible quick fix off the top of my head is to check if the current page's url is a substring of the destination page's url. So if the current url is https://thetrove.is/Books/SomeGame/ and the destination page is https://thetrove.is/Books/SomeGame/Subfolder/ it returns true and continues while if the link sends the program to https://thetrove.is/Books/SomeOtherGame/ it returns false and skips it. Main issue is that the links don't send you to the page directly so while the link may say https://thetrove.is/Books/SomeGame/SomeOtherGame/ that page redirects you to https://thetrove.is/Books/SomeOtherGame/ and if I knew enough about webscraping to know how to solve that I wouldn't be here.

felipegiacomozzi commented 3 years ago

Hello @turnerjoy, @beowulf88 and @NexusEye .

Sorry for the delay, had internet issues here this week. I hopefully fixed this with the new release: https://github.com/felipegiacomozzi/the-trove-downloader/releases

The problem wasn't really that the redirect URL was in a different path but there is actually a redirect page that is loaded before:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>TheTrove</title>
<META HTTP-EQUIV="REFRESH" CONTENT="1; URL=https://thetrove.is/Books/Archipelago/">
</head>
<body>
</body>
</html>

You can't really see it in the browser because the redirect is really quick. That is why the program crashed, it couldn't interpret this redirect page. I added support to that and now it extracts the URL and calls it as the next page.

Thanks for the report guys, that really helped me to find this problem.

kdiii1 commented 3 years ago

I've encountered a bit of a problem with this where the program would be infinitely redirecting. I've attached the log file the-trove-downloader.log

felipegiacomozzi commented 3 years ago

There is a lot of cyclic redirection, so I had to remove the redirect navigation: https://github.com/felipegiacomozzi/the-trove-downloader/releases/tag/1.0.15

In this release it will ignore redirects.