Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

CsvSplitter and complex CSV input #177

Closed jetnet closed 8 years ago

jetnet commented 8 years ago

Hi Norconex team,

could you please take a look at the following issue: Input CSV (exported from an excel file):

id;title;text
1;"Epoch & Unix Timestamp Conversion Tools ""Time converter""";"Convert epoch to human readable date and vice versa
Epoch dates for the start and end of the year/month/day
Convert seconds to days, hours and minutes"
2;"Converters - ""Epoch converter; Batch converter""";"What is epoch time?
The Unix epoch (or Unix time or POSIX time or Unix timestamp) is the number of seconds that have elapsed since January 1, 1970 (midnight UTC/GMT),
not counting leap seconds (in ISO 8601: 1970-01-01T00:00:00Z). Literally speaking the epoch is Unix time 0 (midnight 1/1/1970),
but 'epoch' is often used as a synonym for 'Unix time'. Many Unix systems store epoch dates as a signed 32-bit integer,
which might cause problems on January 19, 2038 (known as the Year 2038 problem or Y2038)."

Importer configuration (2.3.0-snapshot):

      <importer>
          <preParseHandlers>
            <splitter class="com.norconex.importer.handler.splitter.impl.CsvSplitter"
                        separatorCharacter=";"
                        quoteCharacter="&quot;"
                        useFirstRowAsFields="true"
                        linesToSkip="0"
                        referenceColumn="id"
                        contentColumns="text" >
            </splitter>
        </preParseHandlers>
      </importer>

processing:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://localhost:8080/test.csv
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://localhost:8080/test.csv
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://localhost:8080/test.csv
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-1
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-1
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-2
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-2
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-3
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-3
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-4
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-4
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-5
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-5
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-6
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-6
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-7
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-7
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-8
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-8
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://localhost:8080/test.csv!row-9
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://localhost:8080/test.csv!row-9

a meta data from a result document:

column3="Convert epoch to human readable date and vice versa
column2="Epoch & Unix Timestamp Conversion Tools ""Time converter"""
column1=1
document.embedded.reference=row-2
document.embedded.parent.reference=http\://localhost\:8080/test.csv
document.reference=http\://localhost\:8080/test.csv\!row-2

So, now the questions are coming :) 1) why does the CSV-splitter does not recognize the quoted fields and multi-lines values? 2) why it does not recognize the column names from the first row (useFirstRowAsFields=true)?

Thank you!

essiembre commented 8 years ago

I tried reproducing with your exact config above and your exact csv file content, and it works just fine for me. I get two documents and this as the metadata for the first document (first row of data):

[id]
1

[text]
Convert epoch to human readable date and vice versa
Epoch dates for the start and end of the year/month/day
Convert seconds to days, hours and minutes

[title]
Epoch & Unix Timestamp Conversion Tools "Time converter"

[document.embedded.reference]
1

[document.embedded.parent.reference]
http://example.com/blah.csv

[document.reference]
http://example.com/blah.csv!1

[Content-Encoding]
ISO-8859-1

[document.contentFamily]
spreadsheet

[Content-Type]
text/plain; charset=ISO-8859-1

[X-Parsed-By]
org.apache.tika.parser.DefaultParser

[document.contentType]
text/csv

[document.embedded.parent.root.reference]
http://example.com/blah.csv

For the content I get this:

Convert epoch to human readable date and vice versa
Epoch dates for the start and end of the year/month/day
Convert seconds to days, hours and minutes

Maybe if you paste your full config I may see something wrong with it.

jetnet commented 8 years ago

hi Pascal,

hm... I still get 9 docs from the CSV, my config:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="HTTPCollector">

  <progressDir>./examples-output/minimum-intranet/progress</progressDir>
  <logsDir>./examples-output/minimum-intranet/logs</logsDir>

  <crawlerDefaults>
    <userAgent>UA</userAgent>
    <workDir>./examples-output/minimum-intranet</workDir>
    <numThreads>1</numThreads>
    <maxDepth>10</maxDepth>
    <sitemap ignore="true" />
    <delay default="0" />
    <referenceFilters>
        <filter 
            class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"
            onMatch="exclude" >
          jpg,gif,png,ico,css,js
        </filter>
    </referenceFilters>
    <robotsTxt ignore="true" class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider"/>
    <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./examples-output/minimum-intranet/crawledFiles</directory>
    </committer>
  </crawlerDefaults>

  <crawlers>
    <crawler id="test">
      <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <cookiesDisabled>false</cookiesDisabled>
      </httpClientFactory>
      <startURLs stayOnDomain="false" stayOnPort="true" stayOnProtocol="true">
        <url>http://localhost:8080/test.csv</url>
      </startURLs>
      <maxDocuments>100</maxDocuments>
      <importer>
          <preParseHandlers>
            <splitter class="com.norconex.importer.handler.splitter.impl.CsvSplitter"
                        separatorCharacter=";"
                        quoteCharacter="&quot;"
                        useFirstRowAsFields="true"
                        linesToSkip="0"
                        referenceColumn="id"
                        contentColumns="text" >
            </splitter>
        </preParseHandlers>
      </importer>
    </crawler>
  </crawlers>
</httpcollector>
essiembre commented 8 years ago

I was able to reproduce with your full config. Your file is being split properly, but several additional files are created, and those are the ones you should eliminate.

Nested-splitting: One set of files gets created when the CvsSplitter is trying to split every files with a content. This means the first two split records are being split again and the output is obviously bad since they were not meant to be split. To get around this, you have to tell the splitter to only address the CSV files.

Saving of parent file: The original .csv file is also being saved since you do not specify anywhere you do not want to keep it. You can easily get rid of it by rejecting files ending with .csv.

The two "fixes" can be applied like this:

      <importer>
          <preParseHandlers>
            <splitter class="com.norconex.importer.handler.splitter.impl.CsvSplitter"
                        separatorCharacter=";"
                        quoteCharacter="&quot;"
                        useFirstRowAsFields="true"
                        linesToSkip="0"
                        referenceColumn="id"
                        contentColumns="text" 
                        >
                <!-- This ensures only .csv files get splitted (also excluding its children) -->
                <restrictTo field="document.reference">.*\.csv$</restrictTo>
            </splitter>

            <!-- Do not save parent .csv file, reject it -->
            <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                  onMatch="exclude" field="document.reference" >.*\.csv$</filter>
        </preParseHandlers>
      </importer>

The above should make it work for you. Please confirm.

jetnet commented 8 years ago

Hi Pascal,

thank you for the explanation, that makes sense :) The parameters did not help with the norconex libs from last month, but after updating to the latest release "2015-11-06" - it works just perfect.

Thank you for the great support!

essiembre commented 8 years ago

You are very welcome! Glad it works for you.