asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

FileNotFoundException: .m2\repository\edu\uci\ics\crawler4j\4.0\crawler4j-4.0.jar!\tld-names.zip #329

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. running under proxy it tries to download the 
"https://publicsuffix.org/list/effective_tld_names.dat" file and files
2. tries to get the "tld-names.zip" embedded in the jar file
3. fails to load: 
file:\C:\Users\XXXX\.m2\repository\edu\uci\ics\crawler4j\4.0\crawler4j-4.0.jar!\
tld-names.zip 

What is the expected output? What do you see instead?
- it should read the "tld-names.txt" from tld-names.zip 
- instead it fails with :

17:16:25 ERROR [main] - [TLDList]- Couldn't find tld-names.txt
java.io.FileNotFoundException: 
file:\C:\Users\XXXX\.m2\repository\edu\uci\ics\crawler4j\4.0\crawler4j-4.0.jar!\
tld-names.zip (The filename, directory name, or volume label syntax is 
incorrect)
    at java.io.RandomAccessFile.open(Native Method) ~[na:1.7.0_72]
    at java.io.RandomAccessFile.<init>(Unknown Source) ~[na:1.7.0_72]
    at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:211) ~[commons-compress-1.5.jar:1.5]
    at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:190) ~[commons-compress-1.5.jar:1.5]
    at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:162) ~[commons-compress-1.5.jar:1.5]
    at edu.uci.ics.crawler4j.url.TLDList.<init>(TLDList.java:41) [crawler4j-4.0.jar:na]
    at edu.uci.ics.crawler4j.url.TLDList.<clinit>(TLDList.java:27) [crawler4j-4.0.jar:na]
    at edu.uci.ics.crawler4j.url.WebURL.setURL(WebURL.java:103) [crawler4j-4.0.jar:na]
    at edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(CrawlController.java:355) [crawler4j-4.0.jar:na]
    at edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(CrawlController.java:311) [crawler4j-4.0.jar:na]
    at com.flado.rent.crawler.basic.BasicCrawlController.main(BasicCrawlController.java:98) [classes/:na]

What version of the product are you using?
4.0

Please provide any additional information below.
The code in TLDList.java must be changed to use:

public static void extractEntry(final ZipEntry entry, InputStream is) throws 
IOException {    
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    String line;
    while ((line = reader.readLine()) != null) {
      line = line.trim();
      if (line.isEmpty() || line.startsWith("//")) {
        continue;
      }
      logger.info("{}", line);
    }
  }

InputStream is = ClassLoader.getSystemResourceAsStream(TLD_NAMES_ZIP_FILENAME);
    ZipInputStream zis = new ZipInputStream(is);
    ZipEntry entry;
    while ((entry = zis.getNextEntry()) != null) {
      logger.info("entry: " + entry.getName() + ", " + entry.getSize());
      extractEntry(entry, zis);
    }

Original issue reported on code.google.com by florin.a...@gmail.com on 29 Dec 2014 at 6:19

GoogleCodeExporter commented 9 years ago
Thank you,

I will check it out as soon as I can (might take a day or two) and submit a new 
patch.

Original comment by avrah...@gmail.com on 29 Dec 2014 at 8:17

GoogleCodeExporter commented 9 years ago
Hi
I am issuing the same issue. Any updates on it?

Thank You
Parth Patel

Original comment by optimus....@gmail.com on 13 Jan 2015 at 6:24

GoogleCodeExporter commented 9 years ago
Hey

Is the problem solved? My project deadline is near. Please do something. 

Waiting for a prompt reply.

Parth Patel

Original comment by optimus....@gmail.com on 19 Jan 2015 at 11:30

GoogleCodeExporter commented 9 years ago
Fixed in Revision: b393d121fbbf

Simplified the code

Original comment by avrah...@gmail.com on 19 Jan 2015 at 4:11

GoogleCodeExporter commented 9 years ago
Hey
Thank you ... but where can I download the latest jar from?

Parth Patel

Original comment by optimus....@gmail.com on 19 Jan 2015 at 5:15

GoogleCodeExporter commented 9 years ago
Dear colleagues, I am trying to test the crawler4j but I get the same error 
explained above. I am using maven dependency from maven central repository, 
however, it does not work fine yet. There is an updated version in maven 
central?? I'm using this configuration: 
https://code.google.com/p/crawler4j/wiki/MavenConfig

Best regards.

Original comment by yhidalg...@gmail.com on 31 Jan 2015 at 10:49

GoogleCodeExporter commented 9 years ago
Sorry, known bug...

Till we fix it in the Maven repo - you can work with the jar file (instead of 
Maven) from here:
https://groups.google.com/forum/#!topic/crawler4j/M-b3nl6eUA0

Original comment by avrah...@gmail.com on 2 Feb 2015 at 9:49

GoogleCodeExporter commented 9 years ago
Thanks for your reply. I will try to work with the jar file.

Original comment by yhidalg...@gmail.com on 5 Feb 2015 at 7:11