fmd-project-team / FMD

The new FMD fork! Join our community on Discord!
https://discord.gg/cXKKgw3
GNU General Public License v2.0
263 stars 33 forks source link

Can't download some images [HentaiFox] #658

Closed Slasar41 closed 4 years ago

Slasar41 commented 4 years ago

Currently the HentaiFox module grab the images by changing the thumbnails URL into normal images URL, like /XXt.jpg into /XX.jpg I just notice some thumbnails eventhough it's XXt.jpg if I remove the t then it's become 404 Not Found, and FMD failed to download it So I try manually open the normal image and it shows as X.png! Example link: https://hentaifox.com/gallery/63312/

SDXC commented 4 years ago

Maybe it's better to get the links one by one like this example: https://github.com/fmd-project-team/FMD/commit/08b0814f896179bb00d1e939ee478ebda9463bfb#diff-aad3ab1e472d073131b0a7712ef5fc11R101

This is a 2-step-procedure: First it goes through all the pages and gets the image url and then it downloads the images.

In the GetPageNumber function you'll parse for the "a/href" attributes instead of the img tags and in the GetImageURL function you'll extract the actual img url. After that FMD downloads the images normally.

Slasar41 commented 4 years ago

Thanks for the example. Alright, I'll try it.

Slasar41 commented 4 years ago

@SDXC As I'm understand, function _M.GetPageNumber() in NineManga module it grab all the links from the dropdown menu in that page, but it's not applied to HentaiFox as I don't see any link or I just can't see it. Could you give me another example?

SDXC commented 4 years ago

@Slasar41 Here's just one way to possibly get it done:

  1. Find the page count. I don't know if the actual pages are counted or the count of images files. This is important regarding double pages, so test this approach with both, single and double pages. In your example you should be able to find the page count here: <span class="i_text pages">Pages: 22</span>

  2. Get the links to the reader pages, where you later have to extract the img urls: <div class="gallery_thumb"> <div class="g_thumb"> <a href="/g/63312/1/"> This is an example of the first link. In the HTML (before you click "View All"), you will notice that only 10 pages are listed. To get the other 12 pages you would need to make an XmlHttpRequest (XHR.Get()) or simply increment the current page in the href-attribute. So it would be best to extract the page count from 1. and also extract the gallery id from the url and then you can build the urls yourself into MaybeFillHost(module.RootURL, '/g/' .. galleryId .. '/' .. currentPage)

  3. Drop these links into the PageContainerLinks array like the NineManga example.

  4. Go through the links and extract the actual img link and put it in the PageLinks array. That should do it.

Important: GetImageURL is automatically called for each entry in PageContainerLinks. The parameter workid is automatically incremented as well.

Slasar41 commented 4 years ago

I still don't know how to enter the reader pages properly. Here's what I already write: function getpagenumber() task.pagelinks.clear() if http.get(MaybeFillHost(module.rooturl, url)) then local x=TXQuery.Create(http.Document) local galleryId = x.xpathstring('//a[@class="g_button"]/@href/substring-before(.,"/1/")') task.pagenumber = tonumber(x.xpathstring('//span[@class="i_text pages"]/substring-after(.,": ")')) x.xpathstringall(MaybeFillHost(module.RootURL .. galleryId .. '/' .. IncStr(url)), task.PageContainerLinks) else return false end return true end

function getimageurl() local u = MaybeFillHost(module.RootURL, task.PageContainerLinks[workid]) if http.Get(u) then task.PageLinks[workid] = TXQuery.Create(http.document).XPathString('//div[@class="full_image")]//img/@src') return true end return false end

SDXC commented 4 years ago

Here is the working part:

function getpagenumber()
  if http.get(MaybeFillHost(module.rooturl, url)) then
    local x=TXQuery.Create(http.Document)
    local galleryId = x.xpathstring('//a[@class="g_button"]/@href'):match('/.-/(%d+)/')
    task.pagenumber = tonumber(x.xpathstring('//span[@class="i_text pages" and contains(., "Pages")]/substring-after(.,": ")'))
    for i = 1, task.pagenumber do
      task.PageContainerLinks.Add(MaybeFillHost(module.RootURL, '/g/' .. galleryId .. '/' .. i))
    end
  else
    return false
  end
  return true
end

function getimageurl()
  local u = MaybeFillHost(module.RootURL, task.PageContainerLinks[workid])
  if http.Get(u) then
    task.PageLinks[workid] = TXQuery.Create(http.document).XPathString('//div[@class="full_image"]//img/@src')
    return true
  end
  return false
end

Now lets look at the details:

  1. local galleryId = x.xpathstring('//a[@class="g_button"]/@href'):match('/.-/(%d+)/') It is better to get only the id without any other parts of the url. This is important if the format changes or if the id is needed for other things.

  2. task.pagenumber = tonumber(x.xpathstring('//span[@class="i_text pages" and contains(., "Pages")]/substring-after(.,": ")')) Your version gets two results from which is one that doesn't contain the page number. To make sure it doesn't get the wrong node we just look if the content also contains the word "Pages".

  3. for i = 1, task.pagenumber do task.PageContainerLinks.Add(MaybeFillHost(module.RootURL, '/g/' .. galleryId .. '/' .. i)) end GetPageNumber() only runs once. If you generate a list of near-identical entries where you only increment a specific part of the entries, it is best to use a for loop and use the automatically incremented iterator variable i.

  4. Also in your script part you had a small typo in the MaybeFillHost() function. You used the syntax MaybeFillHost(module.RootURL .. '/g/ .. but the correct syntax is
    MaybeFillHost(module.RootURL, '/g/ .. Use , instead of .. after the RootURL.

  5. Your XPath in the GetImageURL function also had a small typo in it. You had a ) in the div node.

  6. Don't forget to register the GetImageURL() function in AddWebsiteModule: m.ongetimageurl = 'getimageurl'

I don't know if you did some other things, so here is also my draft attached. You can fix your lua script and make a pull request like always ;-) HentaiFox.txt

Slasar41 commented 4 years ago

Great, thank you! It's really helped me learn more about lua. I still have one question, what does /.-/(%d+)/ mean? It's looks random for me. The same for :gsub('(/%d+)[tT]%.', '%1.'), I only know it's for removing characters. Also, should I split the module into two module websites and following the template or just keep it as is?

SDXC commented 4 years ago

Maybe you know Regular Expressions? In Lua there is something similar called patterns.

Lets construct the pattern by example

If you have a string /g/43543/1/ you can use the pattern /.-/(%d+)/ to extract the galleryId:

  1. The /characters are obviously for matching the same character in the string.

  2. .- means matching "any" character consecutively until you reach the character next to it for the first time. Using /.-/ will match with /g/

  3. %d means to match any single digit number. %d+ matches any number consecutively. Completing the match going by the example this means we should have this match /g/43543/.

  4. Now, how to get a specific part of the match, so you can use it? By putting the specific part into ( and ). So if your pattern matches with a part of your string, it will return the selected value. In our case it will be 43543.

gsub() is a mix of everyday-functions like "match", "replace" and "substring". Just to simply answer: The first parameter in your gsub example is the search/match parameter. And the second parameter the content you want the match to be replaced with. When using patterns, you can extract the match (inside the brackets (/%d+)) and use it dynamically in the replacement string. To get the extracted match, you need to use %1. If you have matches in more than one pair of brackets, you can use %2, %3 and so on for each pair. The parameter numbering is defined from left to right in your search/match pattern. If you encapsulate brackets inside brackets it will be a little more tricky though ;-)

Also regarding the HentaiFox module: It doesn't really matter that much. In the first place it is important that the module works. Going by the new module structure is not a must, but I hope it will help later on.

Slasar41 commented 4 years ago

Great explanation! So it's regular expressions, I really don't know it until now. I need to learn more about it. :3 Alright, I'll make a PR later.