amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

Generate SitemapTool.jar from the SItemapTester #44

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Currently, in our sitemaps package we have the following file: 
SiteMapTester.java

The file name is tricky as we shouldn't have a Test file in our regular src 
tree (as Lewis has previously mentioned).

After examination of the file, I think that I understand the need for that 
file, its use is to take an online sitemap and parse it recursively, while 
printing all of the sitemap urls as a list (done recursively, so if this is an 
index sitemap it will parse all of the other sitemaps etc and print out all of 
the url entries to the console). 

I actually like this sitemap parsing util, because it gives me an answer that 
our library doesn't support natively.

My most common scenario of parsing sitemaps is parsing sitemaps recursively 
while giving me the list of URLs - this was my original requirement when I 
stumbled upon this library, I have a php script site, and I wanted to have a 
list of all of my URLs...

We should have this functionality (of parsing recursively over a sitemap while 
retrieving a list of urls) as a seperate jar tool.

I'd suggest SiteMapTool.

It would be cleanest if this was a separate artifact from the build - e.g. we 
create a crawler-commons jar, and a crawler-commons-tools.jar, where the latter 
is an uber jar (includes all dependencies) so you can just run it from the 
command line.

We should also rename the original Java file accordingly

Original issue reported on code.google.com by avrah...@gmail.com on 4 Jul 2014 at 9:36

GoogleCodeExporter commented 8 years ago

Original comment by kkrugler...@transpac.com on 8 Jul 2014 at 4:31

GoogleCodeExporter commented 8 years ago
I added this class as a simple way of checking what we were getting for a given 
sitemap URL. OK for renaming it to SiteMapTool but I don't think we need to 
provide the recursivity parsing as this can easily be built on top of CC + I'd 
rather avoid multiplying the jars we produce, especially for such a small 
functionality.
What about building this functionality outside CC and host it as a separate 
project e.g. on GitHub?

Original comment by digitalpebble on 9 Jul 2014 at 8:07

GoogleCodeExporter commented 8 years ago
I still think we need this functionality inhouse.

I am playing with sitemap parsing and I return to this tool each time and use 
it.

There is a place for a git hub project using netty or whatever Ken suggested 
for heavy duty sitemap parsing! - using our library as 3rd party parsing for 
sitemap and using something else for the heavy duty networking.

But I still think we need it inhouse.

Maybe we should put it in the "test" folder ?

What is so bad about another jar ?

Original comment by avrah...@gmail.com on 11 Jul 2014 at 2:48

GoogleCodeExporter commented 8 years ago
What is so bad about another jar ?

Having one JAR separate file for such a small functionality does not make sense 
+ we want to avoid multiplying them as I explained above.

Original comment by digitalpebble on 14 Jul 2014 at 1:10

GoogleCodeExporter commented 8 years ago
ok.

This is the conclusion of what we will do in this issue:

Just rename the file to the "Tool" suffix instead of "Test".

Please note that this issue will be taken care of after the submission of 
issue39

Original comment by avrah...@gmail.com on 18 Jul 2014 at 7:08

GoogleCodeExporter commented 8 years ago
Please note that this issue will be taken care of after the submission of  
issue43

And not issue39 as I wrote in the last comment 

Original comment by avrah...@gmail.com on 18 Jul 2014 at 7:20

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Jul 2014 at 8:05