crawler-commons / crawler-commons

A set of reusable Java components that implement functionality common to any web crawler
Apache License 2.0
234 stars 75 forks source link

Possible Defect: BufferedReader(new InputStreamReader(effective_tld_data_stream)); #2

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
{code}
java.lang.IllegalArgumentException: java.text.ParseException: A prohibited code 
point was found in the input��krehamn
{code}

We need to use explicitly -Dfile.encoding=UTF8 as a startup parameter. Must be 
documented.

The problem currently happens with Bamboo environment, where we can't manage 
this to run JUnit tests.

See last line 
{code}
EffectiveTldFinder
... 
    public boolean initialize(InputStream effective_tld_data_stream) {
        domains = new HashMap<String, EffectiveTLD>();
        try {
            if (null == effective_tld_data_stream && null != this.getClass().getResource(ETLD_DATA)) {
                effective_tld_data_stream = this.getClass().getResourceAsStream(ETLD_DATA);
            }
            BufferedReader input = new BufferedReader(new InputStreamReader(effective_tld_data_stream));
{code}

It tries to read "/effective_tld_names.dat" using default charset.

Original issue reported on code.google.com by fefe...@outsideiq.com on 10 Jun 2011 at 3:43

GoogleCodeExporter commented 9 years ago
I fixed it by explicit configuring Maven surefire plugin:

                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.5</version>
                <configuration>
                    <forkMode>always</forkMode>
                    <argLine>-Dfile.encoding=UTF8 -Xmx256m ${argLine}</argLine>
                </configuration>

Original comment by fefe...@outsideiq.com on 10 Jun 2011 at 5:10

GoogleCodeExporter commented 9 years ago
Additionally to the issue with encoding (reported by me via different Email 
post):

When I import the project to Eclipse (Juno Release with M2E plugin), import as 
an existing Maven project, it automatically puts "enable project specific 
configuration" and "Java versiopn: 1.5", although my default setting is Java 
1.6. I fixed that by changing Maven POM.

Please add this to Maven POM (and ignore previous suggestion):
==============================================================

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                    <compilerVersion>1.6</compilerVersion>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
        </plugins>
    </build>

Thanks,
Fuad Efendi

Original comment by fuad.efe...@tokenizer.ca on 6 Oct 2012 at 2:07

GoogleCodeExporter commented 9 years ago
Note that build.properties contains 
build.encoding=ISO-8859-1

- but maven build runs fine with UTF-8 in my environment... thanks

Original comment by fuad.efe...@tokenizer.ca on 6 Oct 2012 at 2:14

GoogleCodeExporter commented 9 years ago
Note that the build is done with ANT+MVN tasks i.e. 'ant test'. The pom.xml is 
used merely for managing dependencies and publishing the artefacts. You can use 
'ant eclipse' to generate the .classpath, .project and .settings automatically 

Original comment by digitalpebble on 5 Nov 2012 at 3:39

GoogleCodeExporter commented 9 years ago
As Julien noted, the build is done using ant (with Maven ant tasks). If you 
want to be able to build the project via Maven, then please file a separate 
issue with a specific patch for updating the pom.xml file. Also note that 
keeping the pom.xml up-to-date will be something outside the scope of the 
project maintainers.

Original comment by kkrugler...@transpac.com on 6 Nov 2012 at 4:40

GoogleCodeExporter commented 9 years ago
"keeping the pom.xml up-to-date" sounds unclear to me. I prefer thinking that 
we need to keep POM up to date especially if we publish it in many different 
places. Do we publish ANT build script with properties file in Maven repository?

Thanks

Original comment by fuad.efe...@tokenizer.ca on 6 Nov 2012 at 2:30

GoogleCodeExporter commented 9 years ago
FYI, "ant test" didn't fail in my patched environment. I want to understand why 
we need ANT... historical reasons? 

Original comment by fuad.efe...@tokenizer.ca on 6 Nov 2012 at 2:38

GoogleCodeExporter commented 9 years ago
FYI: "ant test" didn't fail in _freshly_ checked out SVN version. 

Original comment by fuad.efe...@tokenizer.ca on 6 Nov 2012 at 2:46