Anahkiasen / flatten

A package to flatten any website to plain HTML
336 stars 42 forks source link

Homepage cache is wrong #25

Closed iGeckoDev closed 9 years ago

iGeckoDev commented 10 years ago

For some reason the cache of the homepage does not display the homepage data, instead it show displays the data of another page. All other pages are cached properly. Any idea what might be causing this?

iGeckoDev commented 10 years ago

I have been doing some more testing. When I cache an individual page it works properly. The problem occurs when caching multiple pages. The content of the first page to be cached is always replaced by the content of the last page to be cached. The bug only occurs when using 'artisan flatten:build'.

TheTekton commented 9 years ago

root_cached_twice

I think there are two things going on here:

First, which doesn't really seem to be a big issue, the index gets queued twice (at least in some cases). I would prefer it only get queued once, however. The reason this happens, I think, is that there are two different ways you can have the root url (e.g. relative/absolute) in your content. The root provided by --root= or from the app config doesn't seem like it's supposed to end with '/'. Also, the root is preset in the queue array as '/', in the Crawler class. So, for a quick hack, I just added a check to the Crawler->queueLink() method (highlighted red in attached image).

Second, I think the root is being replaced with the last page cached, because of the key staying the same. I haven't ran through the code for the salt yet, but I'm assuming that if that's not set in the config, that will cause this problem. I'll have more time this evening to figure that out. By now, I'm guessing you either gave up or found a workaround, but I'm sure someone else will be equally confused as us.

TheTekton commented 9 years ago

From what I can tell, when using artisan flatten:build, the hash is always 'Get-/', for every page. I don't see anywhere in the BuildCommand code where the hash is actually set. I think it needs to be set to the request type and page url, for every page crawled. My flatten.json looks like this, after crawling a few pages: {"cached":["GET-\/","GET-\/","GET-\/"]

TheTekton commented 9 years ago

cachehandler php crawler php Adding a CacheHandler->setHash() method and modifying the Crawler->getPage($url) method to use that with Flatten->computeHash() works for me.

Anahkiasen commented 9 years ago

Will reopen if this is still happening on 1.0