Letractively / simplesitemapcreator

Automatically exported from code.google.com/p/simplesitemapcreator
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

A Question #1

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Is this project alive? If yes, I would be happy to test it.
Sujay

Original issue reported on code.google.com by insights...@gmail.com on 3 Aug 2012 at 2:32

GoogleCodeExporter commented 8 years ago
Hi Sujay, sorry I missed your comment. For some reason Google Code will email 
me when you reply but not when you add a new issue.

This project isn't dead, but it's a low priority at the moment - I would like 
to pick up where I left off thought it was quite promising.

Original comment by hippy2094 on 19 Aug 2012 at 7:19

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Nopes, not a problem. :)

Keep me informed if you put your hand on it in future & make a download 
available. A reply in this section would be enough. ;)

Original comment by insights...@gmail.com on 19 Aug 2012 at 8:55

GoogleCodeExporter commented 8 years ago
Well, I couldn't resist ;) An alpha for Windows is available on the Downloads 
page. I shouldn't try to crawl half the internet!

Original comment by hippy2094 on 20 Aug 2012 at 2:02

GoogleCodeExporter commented 8 years ago

Original comment by hippy2094 on 20 Aug 2012 at 2:02

GoogleCodeExporter commented 8 years ago
I can't find any Alpha effect in it. Worked perfectly here. Posted about it in 
my forum.
http://insightsintechnologyforum.tk/Thread-Simple-Sitemap-Creator-Alpha?pid=24

Original comment by insights...@gmail.com on 20 Aug 2012 at 8:02

GoogleCodeExporter commented 8 years ago
I can see from your screenshots things aren't 100% correct, those entries 
starting "/www.blogger.com" are a bit odd. 

I also actually completely forgot to process links that don't have plain text 
:D There is also an apparent bug in the tag parsing, your first line is 
including the style attribute from the A tag instead of dropping it and placing 
the image tag in there.

Original comment by hippy2094 on 21 Aug 2012 at 8:45

GoogleCodeExporter commented 8 years ago
You must be right about that. I know very little about all this. All I 
understood is that the program is pretty stable but coding may have errors, as 
you know better. Something was suspicious for me too, it picked up many more 
links from my blog than other sitemap tools do. For your use I have uploaded 
the sitemap it created.
http://www.datafilehost.com/download-11654091.html

Original comment by insights...@gmail.com on 21 Aug 2012 at 2:04

GoogleCodeExporter commented 8 years ago
The main issues I see from crawling your site are:

1) It's following https links on different domains
2) It's not stripping anchors from links so considers it a new link
3) It needs to ignore javascript and invalid links.

Thanks for your testing efforts, this has given me something to go on :)

Original comment by hippy2094 on 21 Aug 2012 at 3:05

GoogleCodeExporter commented 8 years ago
Great !!
Wish to see an 'improved' version ;)

Original comment by insights...@gmail.com on 21 Aug 2012 at 3:13

GoogleCodeExporter commented 8 years ago
Just a little update (I haven't released a new exe yet), numbers 1 and 2 from 
my list appear to be fixed, its just number 3 to go, which is the fun one :/

Original comment by hippy2094 on 30 Aug 2012 at 7:30

GoogleCodeExporter commented 8 years ago
Good to know that :-) Take care, don't work too hard ;-)

Original comment by insights...@gmail.com on 31 Aug 2012 at 7:47

GoogleCodeExporter commented 8 years ago
0.1.3 released, should be a major improvement :)

Original comment by hippy2094 on 13 Sep 2012 at 4:12

GoogleCodeExporter commented 8 years ago
Sorry for the late response... Will test this today...:)

Original comment by insights...@gmail.com on 22 Sep 2012 at 8:02

GoogleCodeExporter commented 8 years ago
hi, Following are the sitemaps of my site using simple sitemap creator [html 
file] and www.xml-sitemaps.com [text file].

Original comment by insights...@gmail.com on 23 Sep 2012 at 9:20

Attachments:

GoogleCodeExporter commented 8 years ago
Wow, I really didn't think it had been this long since I last looked at this 
application!

I have updated this app, it now includes a Google Sitemaps compatible XML 
output aswell, this seems much more stable than the HTML output which still 
needs work.

Original comment by hippy2094 on 26 Jan 2013 at 8:04

GoogleCodeExporter commented 8 years ago
Will try it when I get time. :)

Original comment by insights...@gmail.com on 28 Jan 2013 at 1:09

GoogleCodeExporter commented 8 years ago
Actually, ignore the 0.1.4 release. I've completely changed the HTML parsing 
routine, which seems a lot better - 0.1.5 will be out in a day or two :)

Original comment by hippy2094 on 29 Jan 2013 at 1:37

GoogleCodeExporter commented 8 years ago
0.1.5 is out, hopefully the HTML parsing is much more reliable now.

Original comment by hippy2094 on 30 Jan 2013 at 7:18

GoogleCodeExporter commented 8 years ago
Ok, I will test it on the weekend. :)

Original comment by insights...@gmail.com on 30 Jan 2013 at 7:20

GoogleCodeExporter commented 8 years ago
It now works better. :) I found about two unnecessary links there.

Original comment by insights...@gmail.com on 2 Feb 2013 at 10:37

Attachments:

GoogleCodeExporter commented 8 years ago
When I crawl the same site the addthis link that appears in your result doesn't 
seem to appear in mine, was that a malformed link that has since been fixed?

The # link is an easy one to fix, source submitted this morning, I'll sort out 
a new release after work :)

I'm starting to think we are finally getting passed the alpha stage, I would 
like to take this opportunity to thank you for your input - perhaps I can find 
room in the About dialog for a special mention :)

Original comment by hippy2094 on 4 Feb 2013 at 7:50

GoogleCodeExporter commented 8 years ago
Hi,
I also think that alpha stage is over.
Thanks for your kind comments, but till now I haven't done anything apart from 
running a scan and attaching the Log :P
Anyway, in my latest scan of the same website, I too didn't find that addthis 
link. But I didn't remove any link.
Today I closely watched the latest log and found a few problems. In the created 
sitemap, it is able to gather the title of a link in some cases and in some 
cases it fails.

Original comment by insights...@gmail.com on 4 Feb 2013 at 8:25

Attachments:

GoogleCodeExporter commented 8 years ago
I did actually notice this myself. I will look into it, and see what I can find

Original comment by hippy2094 on 4 Feb 2013 at 9:13

GoogleCodeExporter commented 8 years ago
Righty, that's fixed, I apologise for the excessive hits I've caused to that 
domain, it didn't happen on any of my sites. I will be releasing 0.1.6 later 
today :)

Original comment by hippy2094 on 5 Feb 2013 at 11:10

GoogleCodeExporter commented 8 years ago
Hi,
That's not a problem. Waiting to test the upcoming version! :)

Original comment by insights...@gmail.com on 5 Feb 2013 at 11:13

GoogleCodeExporter commented 8 years ago
Slight delay, but it's ready for download :)

Original comment by hippy2094 on 6 Feb 2013 at 1:37

GoogleCodeExporter commented 8 years ago
Hi,
The sitemap of techoffer.in worked perfect (See Attachment). But while creating 
sitemap of insightsintechnology.com it went weird. It started to hunt for 
thousands of links and even more when I wanted to terminate the application. I 
had to close it from Task Manager. Can you please check?

Original comment by insights...@gmail.com on 7 Feb 2013 at 6:19

GoogleCodeExporter commented 8 years ago
oh my, thats not good :D i'll look into it tomorrow

Original comment by hippy2094 on 7 Feb 2013 at 10:29

GoogleCodeExporter commented 8 years ago
426 links retrieved - that sounds a bit better, can you browse this attachment 
and tell me if you see anything wrong

Original comment by hippy2094 on 8 Feb 2013 at 11:50

Attachments:

GoogleCodeExporter commented 8 years ago
Hi, I can find the following links abnormal.

http://www.insightsintechnology.com//addthis.com/bookmark.php?v=300
file:/C%3A/Program%20Files/Files%20Terminator%20Free/Help/Help_It.html
faq:%20http%3A//www.f-secure.com/en/web/labs_global/removal/easy-clean/faq

And also I am not sure if the following links should be present.

http://www.insightsintechnology.com/search/label/Image%20Managers
http://www.insightsintechnology.com/search/label/Advertisements
http://www.insightsintechnology.com/search/label/PDF
http://www.insightsintechnology.com/search/label/Online%20Service
http://www.insightsintechnology.com/search/label/SoundCloud
http://www.insightsintechnology.com/search/label/File%20Hash
http://www.insightsintechnology.com/search/label/Encrypt
http://www.insightsintechnology.com/search/label/Shred

Original comment by insights...@gmail.com on 8 Feb 2013 at 4:38

GoogleCodeExporter commented 8 years ago
Taken from 
http://www.insightsintechnology.com/2013/02/aomei-dynamic-disk-manager-pro-givea
way.html (lines 207/208)

<a class="addthis_button" 
href="http://www.insightsintechnology.com//addthis.com/bookmark.php?v=300"  
addthis:url='http://www.insightsintechnology.com/2013/02/aomei-dynamic-disk-mana
ger-pro-giveaway.html' addthis:title='AOMEI Dynamic Disk Manager Pro – 10 
Licenses Giveaway '><img 
src="//cache.addthis.com/cachefly/static/btn/v2/lg-share-en.gif" width="125" 
height="16" alt="Bookmark and Share" style="border:0"/></a>

The parser is getting the correct text from the href :)

Why are you unsure about those links being present? They appear to be valid 
links.

Original comment by hippy2094 on 8 Feb 2013 at 4:59

GoogleCodeExporter commented 8 years ago
http://www.insightsintechnology.com/2012/04/clean-active-malware-infections-with
-f.html#more-191 line 207:

<p>See the product homepage of Easy Clean for more information about it. There 
is also a <a 
href="faq:%20http%3A//www.f-secure.com/en/web/labs_global/removal/easy-clean/faq
" >FAQ</a> on the software that you might like to view.</p>

I'm still searching for the Program Files one

Original comment by hippy2094 on 8 Feb 2013 at 5:04

GoogleCodeExporter commented 8 years ago
http://www.insightsintechnology.com/2012/05/files-terminator-free-can-securely.h
tml#more-155 lines 266/267:

<a 
href="file:/C%3A/Program%20Files/Files%20Terminator%20Free/Help/Help_It.html" 
target="_blank">Italian language</a>

This one and the previous one should have been filtered for not starting with 
the domain, the addthis one is technically a valid link. I'm not sure it's the 
job of a sitemap builder to worry about 404s?

Original comment by hippy2094 on 8 Feb 2013 at 5:09

GoogleCodeExporter commented 8 years ago
Actually I was unsure because, there should be more such links as it is 
indexing some search levels and there are many such levels which it didn't 
index. Also Google never indexes those.

Regarding that 404 error link. Why is it getting only this one? There are 
dozens of 404's which are converted to 301 with the help of a plugin. :)

Original comment by insights...@gmail.com on 9 Feb 2013 at 7:07

GoogleCodeExporter commented 8 years ago
Ah, it doesn't follow redirects. Is there something in robots.txt that stops 
Google following those? Which actually brings up another important question: is 
this program a bot that should follow the rules of robots.txt?

Original comment by hippy2094 on 10 Feb 2013 at 9:29

GoogleCodeExporter commented 8 years ago
I have checked but no such rules are there in robots.txt :)

Original comment by insights...@gmail.com on 10 Feb 2013 at 6:40

GoogleCodeExporter commented 8 years ago
I guess it's this line in your source 

<meta name="robots" content="noindex,follow"/> 

"The spider will not look at this page but will crawl through the rest of the 
pages on your website." - http://www.metatags.info/meta_name_robots

This line doesn't seem to appear on non-search pages.

Original comment by hippy2094 on 11 Feb 2013 at 9:26

GoogleCodeExporter commented 8 years ago
Ok, I will have to check it :)

Original comment by insights...@gmail.com on 11 Feb 2013 at 10:28

GoogleCodeExporter commented 8 years ago
Okey dokey :) Still presents the question of whether or not it should be 
following the rules of robots.txt, I'm thinking yes.

Original comment by hippy2094 on 11 Feb 2013 at 10:43

GoogleCodeExporter commented 8 years ago
New version released :)

Original comment by hippy2094 on 13 Feb 2013 at 6:27

GoogleCodeExporter commented 8 years ago
Will test it on Weekends :D

Original comment by insights...@gmail.com on 13 Feb 2013 at 6:34

GoogleCodeExporter commented 8 years ago
Just so you know, there is now a wiki page of known issues :) 
http://code.google.com/p/simplesitemapcreator/wiki/currentissues

Original comment by hippy2094 on 17 Feb 2013 at 3:16

GoogleCodeExporter commented 8 years ago
Hi,
Sorry for the late response. Just ran a quick test. Few problems with my site 
techoffer.in
If you see carefully the attachment, it is not able to get the name
Page 1 of ----
in all the cases.
No problem with other cases "page 2 of", "page 3 of" etc.

Original comment by insights...@gmail.com on 21 Feb 2013 at 6:55

Attachments:

GoogleCodeExporter commented 8 years ago
Very strange, I will look into it. In the meantime, I have another new project 
I just released for Windows and OSX that you might be interested in: 
http://code.google.com/p/page2png/ let me know what you think :)

Original comment by hippy2094 on 21 Feb 2013 at 9:03

GoogleCodeExporter commented 8 years ago
Hi, Thanks for that.
I have posted an issue in your new project. Nice one!

Original comment by insights...@gmail.com on 21 Feb 2013 at 9:34