Closed jetnet closed 8 years ago
I tried reproducing with your exact config above and your exact csv file content, and it works just fine for me. I get two documents and this as the metadata for the first document (first row of data):
[id]
1
[text]
Convert epoch to human readable date and vice versa
Epoch dates for the start and end of the year/month/day
Convert seconds to days, hours and minutes
[title]
Epoch & Unix Timestamp Conversion Tools "Time converter"
[document.embedded.reference]
1
[document.embedded.parent.reference]
http://example.com/blah.csv
[document.reference]
http://example.com/blah.csv!1
[Content-Encoding]
ISO-8859-1
[document.contentFamily]
spreadsheet
[Content-Type]
text/plain; charset=ISO-8859-1
[X-Parsed-By]
org.apache.tika.parser.DefaultParser
[document.contentType]
text/csv
[document.embedded.parent.root.reference]
http://example.com/blah.csv
For the content I get this:
Convert epoch to human readable date and vice versa
Epoch dates for the start and end of the year/month/day
Convert seconds to days, hours and minutes
Maybe if you paste your full config I may see something wrong with it.
hi Pascal,
hm... I still get 9 docs from the CSV, my config:
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="HTTPCollector">
<progressDir>./examples-output/minimum-intranet/progress</progressDir>
<logsDir>./examples-output/minimum-intranet/logs</logsDir>
<crawlerDefaults>
<userAgent>UA</userAgent>
<workDir>./examples-output/minimum-intranet</workDir>
<numThreads>1</numThreads>
<maxDepth>10</maxDepth>
<sitemap ignore="true" />
<delay default="0" />
<referenceFilters>
<filter
class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"
onMatch="exclude" >
jpg,gif,png,ico,css,js
</filter>
</referenceFilters>
<robotsTxt ignore="true" class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider"/>
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./examples-output/minimum-intranet/crawledFiles</directory>
</committer>
</crawlerDefaults>
<crawlers>
<crawler id="test">
<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
<trustAllSSLCertificates>true</trustAllSSLCertificates>
<cookiesDisabled>false</cookiesDisabled>
</httpClientFactory>
<startURLs stayOnDomain="false" stayOnPort="true" stayOnProtocol="true">
<url>http://localhost:8080/test.csv</url>
</startURLs>
<maxDocuments>100</maxDocuments>
<importer>
<preParseHandlers>
<splitter class="com.norconex.importer.handler.splitter.impl.CsvSplitter"
separatorCharacter=";"
quoteCharacter="""
useFirstRowAsFields="true"
linesToSkip="0"
referenceColumn="id"
contentColumns="text" >
</splitter>
</preParseHandlers>
</importer>
</crawler>
</crawlers>
</httpcollector>
I was able to reproduce with your full config. Your file is being split properly, but several additional files are created, and those are the ones you should eliminate.
Nested-splitting: One set of files gets created when the CvsSplitter is trying to split every files with a content. This means the first two split records are being split again and the output is obviously bad since they were not meant to be split. To get around this, you have to tell the splitter to only address the CSV files.
Saving of parent file: The original .csv file is also being saved since you do not specify anywhere you do not want to keep it. You can easily get rid of it by rejecting files ending with .csv.
The two "fixes" can be applied like this:
<importer>
<preParseHandlers>
<splitter class="com.norconex.importer.handler.splitter.impl.CsvSplitter"
separatorCharacter=";"
quoteCharacter="""
useFirstRowAsFields="true"
linesToSkip="0"
referenceColumn="id"
contentColumns="text"
>
<!-- This ensures only .csv files get splitted (also excluding its children) -->
<restrictTo field="document.reference">.*\.csv$</restrictTo>
</splitter>
<!-- Do not save parent .csv file, reject it -->
<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
onMatch="exclude" field="document.reference" >.*\.csv$</filter>
</preParseHandlers>
</importer>
The above should make it work for you. Please confirm.
Hi Pascal,
thank you for the explanation, that makes sense :) The parameters did not help with the norconex libs from last month, but after updating to the latest release "2015-11-06" - it works just perfect.
Thank you for the great support!
You are very welcome! Glad it works for you.
Hi Norconex team,
could you please take a look at the following issue: Input CSV (exported from an excel file):
Importer configuration (2.3.0-snapshot):
processing:
a meta data from a result document:
So, now the questions are coming :) 1) why does the CSV-splitter does not recognize the quoted fields and multi-lines values? 2) why it does not recognize the column names from the first row (useFirstRowAsFields=true)?
Thank you!