Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
22 stars 13 forks source link

Contents under zip are marked as deleted on the second run, even if the contents are present in that zip. #24

Closed jayjamba closed 4 years ago

jayjamba commented 6 years ago

Hi, I splitted the xml files generated for zip file using `

application/zip

`. However when I again ran the program in which the zip file was already existing, it produced deleted file xmls for the contents inside zip, can we prevent this behavior somehow ?

Eg: sample.zip --> contains a.txt, b.doc, c.pdf

1st run : it produced 4 xmls, 1) for sample.zip, (add) 2) for sample.zip!a.txt (add) 3) for sample.zip!b.doc (add) 4) for sample.zip!c.pdf (add)

2nd run : it produced 3 xmls, 1) for sample.zip!a.txt (del) 2) for sample.zip!b.doc (del) 2) for sample.zip!c.pdf (del)

Not sure why did it produced deleted xml for the contents of zip, which actually wasn't deleted, it was residing in the same folder.

essiembre commented 6 years ago

A few questions to help identify the issue. On the second run, is the zip file marked as unmodified? Are the deleted entries being added later in the same crawl? What version of the Filesystem Collector are you using? Can you share a copy of your config?

jayjamba commented 6 years ago

Hi, I am using latest version of file system collector i.e. 2.8.0. On the second run 3 xmls are produced whose file name are prefixed by 'del' as I mentioned in my first post having reference of sample.zip!a.txt; sample.zip!b.doc; sample.zip!c.pdf. Below are the logs for your reference:

INFO  - REJECTED_NOTFOUND          -         REJECTED_NOTFOUND: smb://localhost/examples/files/sample.zip!a.txt
INFO  - REJECTED_NOTFOUND          -         REJECTED_NOTFOUND: smb://localhost/examples/files/sample.zip!b.doc
INFO  - XMLFileCommitter           - Creating XML File: d:\fileoutput-1\crawledFiles\del-2018-02-12T11-14-25-737_1.xml
INFO  - XMLFileCommitter           - Creating XML File: d:\fileoutput-1\crawledFiles\del-2018-02-12T11-14-25-737_2.xml
INFO  - DOCUMENT_COMMITTED_REMOVE  - DOCUMENT_COMMITTED_REMOVE: smb://localhost/examples/files/sample.zip!a.txt
INFO  - DOCUMENT_COMMITTED_REMOVE  - DOCUMENT_COMMITTED_REMOVE: smb://localhost/examples/files/sample.zip!b.doc
INFO  - REJECTED_NOTFOUND          -         REJECTED_NOTFOUND: smb://localhost/examples/files/sample.zip!c.pdf
INFO  - XMLFileCommitter           - Creating XML File: d:\fileoutput-1\crawledFiles\del-2018-02-12T11-14-25-737_3.xml
INFO  - DOCUMENT_COMMITTED_REMOVE  - DOCUMENT_COMMITTED_REMOVE: smb://localhost/examples/files/sample.zip!c.pdf

Regarding the zip file, yes, it is marked as unmodified and rejected, below are the logs for your reference:

INFO  - DOCUMENT_METADATA_FETCHED  - DOCUMENT_METADATA_FETCHED: smb://localhost/examples/files/sample.zip
INFO  - REJECTED_UNMODIFIED        -       REJECTED_UNMODIFIED: smb://localhost/examples/files/sample.zip

Below is the copy of my config:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<fscollector id="Text Files">

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>
    <crawler id="Sample Crawler">

      <workDir>${workdir}</workDir>

      <startPaths>
        <path>${path}</path>
      </startPaths>

      <numThreads>2</numThreads>

      <keepDownloads>false</keepDownloads>

      <importer>
            <postParseHandlers>
                <!-- below tagger would add <meta name="title"> element -->
                <tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger" toField="title" titleMaxLength="200" detectHeading="true" />

                <!-- below tagger would add <meta name="document.language"> element with by default language as "'en'" if not found any. 
                Its not necessary to give this tagger if the files are office files coz it anyways gives language tag for those files. -->
                <tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger" fallbackLanguage="en" />
            </postParseHandlers>

            <!-- Below will generate separate document for each document embedded inside zip -->
            <documentParserFactory>
                <embedded>
                    <splitContentTypes>application/zip</splitContentTypes>
                    <splitContentTypes>application/gzip</splitContentTypes>
                </embedded>
            </documentParserFactory>
     </importer>

     <committer class="com.norconex.committer.core.impl.XMLFileCommitter">
            <directory>${outputdir}</directory>
            <pretty>true</pretty>
            <docsPerFile>1</docsPerFile>
            <compress>false</compress>
            <splitAddDelete>true</splitAddDelete>
     </committer>

      <crawlDataStoreFactory class="com.norconex.collector.core.data.store.impl.jdbc.BasicJDBCCrawlDataStoreFactory"/>

      <optionsProvider class="com.norconex.collector.fs.option.impl.GenericFilesystemOptionsProvider">
           <authDomain>${domainName}</authDomain>
           <authUsername>${username}</authUsername>
           <authPassword>${password}</authPassword>
      </optionsProvider>
    </crawler>
  </crawlers>

</fscollector>
essiembre commented 6 years ago

When the zip is considered unmodified, it is not parsed again and its embedded children are considered "orphans". By defaults, orphan files are processed again. I'll mark it as a bug since it should know if the parent file has not changed, the children should remain unchanged as well.

Until that is fixed, there are a few workarounds. You can prevent "orphans" from being re-processed by adding this to your crawler config:

  <orphansStrategy>IGNORE</orphansStrategy>

You can also disable the creation of checksum so your files are re-processed again:

<metadataChecksummer disabled="true" />
<documentChecksummer disabled="true" />
jayjamba commented 6 years ago

Will there be any other impact if I add <orphansStrategy>IGNORE</orphansStrategy> in my config. I mean not sure what does orphans mean in context of file system collector. In case of Http collector,its clear that if there are some dead urls which are not referring to any resource, its considered as orphan. But not sure what does it mean in context of file system collector.

essiembre commented 6 years ago

In the case of the FS Collector, when it scans the directories for files, if a file is no longer listed, it will be considered orphan. So if you use IGNORE, you will likely miss deletions. If you do not have any, fine. Otherwise, this is probably not ideal and you should use the second solution.

You only need to disable the metadataChecksummer, which relies on modified date. This will force each file to be re-parsed I will use the documentChecksummer instead to figure out if it changed. At that point, the children will have been processed naturally and you should not get the deletions.

Not ideal if you want to save parsing unchanged again, but will otherwise work around the issue.

jayjamba commented 6 years ago

By providing just <metadataChecksummer disabled="true" /> is creating xmls, for files residing in the zip, in the second run as well, even if the zip is not modified :(.

essiembre commented 6 years ago

Yes, that is the expected behavior for the workaround, until a fix is made for your original issue.

jayjamba commented 6 years ago

unfortunately I will have to wait for the fix, coz xmls that are again created in the second run is not acceptable in our case.

essiembre commented 6 years ago

You may want to have a look at the documentChecksummer as well, since that one occurs after parsing, you will not save the parsing, but it should prevent it from being written to an XML if unmodified. Make sure you clear your "workdir" before you try.

jayjamba commented 6 years ago

I did had a look at it and overrode my own DocumentCheckSummer class in which I am compulsorily generating checksum for the document even if the disabled flag is true, but still its writing the XML for the files residing in zip, even if they are not modified. Could you please elaborate what exactly I need to do in it ?

essiembre commented 6 years ago

With my first advice with the disabled flag being true, no checksummer will be triggered so your files will always appear new/modified. This is a workaround to not have them marked for deletion.

My second advice is to only have the metadata checksummer disabled, but keep the document checksummer enabled (it is by default). That way it will process/parse your zip again, with its embedded documents, but will conclude they have not changed and they will not be sent to your committer.

jayjamba commented 6 years ago

I followed your second advice, by disabling only metadata checksummer <metadataChecksummer disabled="true" />, however as I mentioned earlier its creating xmls, for files residing in the zip, in the second run as well, even if the zip is not modified :(.

So I was not sure what should I do in document checksummer If I override it, so that it shouldn't be sent to committer.

essiembre commented 6 years ago

I was able to spend more time on this and it turns out to be a bug. The good news is a fix is already available in the latest snapshot.

I tested your scenario with:

<metadataChecksummer disabled="true" />

And it works well with that fix. Please confirm.

I am not marking it as resolved yet since the original issue is not yet resolved (the wrongful deletions when metadataChecksummer is enabled).

jayjamba commented 6 years ago

Hi, I tested it, it works like a charm :) in latest snapshot 2.8.1-SNAPSHOT. Any plans to release this snapshot soon ?

essiembre commented 6 years ago

There is no defined date for releases. I personally would like to have the fix for your original issue in before formal release.

jayjamba commented 6 years ago

Hi, Can we have release (2.8.1) for this issue ?

essiembre commented 4 years ago

Released.