Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
21 stars 13 forks source link

Is there any processing method to exclude <rPh> Tag from sharedStrings.xml in Crawled xlsx File #74

Open ki-suzuki opened 11 months ago

ki-suzuki commented 11 months ago

The contents of the sharedStrings.xml file in the target xlsx file for crawling are as follows.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="8" uniqueCount="8"><si><t>月日</t><rPh sb="0" eb="2"><t>ガッピ</t></rPh><phoneticPr fontId="2"/></si><si><t>会社名</t><rPh sb="0" eb="3"><t>カイシャメイ</t></rPh><phoneticPr fontId="2"/></si><si><t>金額</t><rPh sb="0" eb="2"><t>キンガク</t></rPh><phoneticPr fontId="2"/></si><si><t>支払日</t><rPh sb="0" eb="3"><t>シハライビ</t></rPh><phoneticPr fontId="2"/></si><si><t>締日</t><rPh sb="0" eb="2"><t>シメビ</t></rPh><phoneticPr fontId="2"/></si><si><t>S社</t><rPh sb="1" eb="2"><t>シャ</t></rPh><phoneticPr fontId="2"/></si><si><t>A社</t><rPh sb="1" eb="2"><t>シャ</t></rPh><phoneticPr fontId="2"/></si><si><t>B社</t><rPh sb="1" eb="2"><t>シャ</t></rPh><phoneticPr fontId="2"/></si></sst>

What I ultimately want to obtain is the content excluding the tag. (What i want to do is to remove something like <rPh sb="0" eb="2"><t>キンガク</t></rPh>) Is there any processing method available? I would appreciate your help very much if you could assist me.

sakanaosama commented 10 months ago

Following the import process and content extraction, all tags are removed. Nonetheless, if you wish to exclude specific content based on these tags, you must work at the 'preParseHandlers' level under "Importer," where all the tags are still preserved before extraction. You can find more information about this configuration in the documentation at https://opensource.norconex.com/importer/v2/configuration#tbl-transformer. You can achieve this using the 'ReduceConsecutivesTransformer' or by implementing a custom script using the 'ScriptTransformer.'