Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
22 stars 13 forks source link

File name having hash character ('#') in it is not crawled #47

Closed jayjamba closed 4 years ago

jayjamba commented 5 years ago

Hi, I have put one simple text file having hash(#) in its name like #1.txt and when I try to crawl it using this path smb://localhost/shared/test, its not getting crawled. And when I try to crawl #1.txt using absolute file path like "d:/shared/test" its getting crawled fine.

Also when I rename it to 1.txt the file is discovered and is getting crawled fine with and without smb.

Is there any issue with # character in filename and smb protocol ?

jayjamba commented 5 years ago

Hi,

Further to add on to this issue, I tried to access this file using VFS, and my last SOP is returning true.

StaticUserAuthenticator auth = new StaticUserAuthenticator("localhost", "myusername", "mypassword");
FileSystemOptions opts = new FileSystemOptions();
DefaultFileSystemConfigBuilder.getInstance().setUserAuthenticator(opts, auth);
FileObject fo = VFS.getManager().resolveFile("smb://localhost/shared/test/#1.txt", opts);
System.out.println(fo.exists());
essiembre commented 5 years ago

Turns out the full path was obtained as a Java URL, and the URL class strips the pound sign and whatever after. I made a fix and it is in the latest snapshot. Please give it a try and confirm.

jayjamba commented 5 years ago

Its picking up the file with latest snapshot but one strange thing I observied, is that its trimming the last character of the file. For eg.: if the file content has "hello world", then in my output xml file it just gives "hello worl"

jayjamba commented 5 years ago

1.txt The last character prob. is there in 2.8.0 version too, I just noticed now, I have attached the sample file which is getting crawled.

essiembre commented 5 years ago

I was able to reproduce. Will fix.

essiembre commented 5 years ago

A new snapshot was just created that fixes the character truncation. Please give it a try and confirm.

jayjamba commented 5 years ago

Hi, I tried with latest snapshot, but this time it returned me unreadable characters in content tag. I have attached the output xml with it. Input file is same which I attached last time having 'hello world' as its content. crawledFiles.zip

essiembre commented 5 years ago

Looks like a character encoding detection issue. In your case it gets detected as IBM500 and since that's probably not the right charset, it reads in wrong.

You can test that on this site: http://string-functions.com/encodedecode.aspx Type "hello world" with encode UTF-8 and decode IBM500. You should get the same funky characters.

It may happen because there is not enough text in your test file to properly detect the encoding. Do you have this issue with larger files as well?

You can try to set the character encoding yourself as a pre-parse handler and see it it picks that instead. You can do so with the ConstantTagger, like this:

  <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="replace" >
      <constant name="document.contentEncoding">UTF-8</constant>
  </tagger>

If the above does not work, you can try diabling parsing altogether if you are really dealing with simple unformatted text. You can also have a look at the CharsetTransformer if that helps. Hopefully the first suggestion will work just fine.

jayjamba commented 5 years ago

Yeah..may be that was file issue. Anyway its now getting the last character as well. Any plans of releasing these issues: 1) file name containing # 2) last character of file getting trimmed out. 3) issue #24 4) AbstractCollector#getState() in the latest collector-core snapshot

truezjz commented 5 years ago

Hi Parscal,

I found a issue is related to this one, so I post here.

If a file name contain #, it crawl fine, however when I read the file from commiter, the # convert to %23 eg. the original file is C:\Users\Administrator\Documents\works\Design_docs.#db_tables6.3.sql.1.10, when the reference become: file://C:/Users/Administrator/Documents/works/Design_docs/.%23db_tables6.3.sql.1.10

it caused file not found error. I cannot just replace the %23 back to #, since I cannot determine if original file really have %23 in the file name

essiembre commented 5 years ago

A new snapshot version was just made, which will store local paths without encoding the #. Please give it a try and confirm.

essiembre commented 4 years ago

Fixes in this ticket are now in the official release.