Closed jayjamba closed 4 years ago
Hi,
Further to add on to this issue, I tried to access this file using VFS, and my last SOP is returning true.
StaticUserAuthenticator auth = new StaticUserAuthenticator("localhost", "myusername", "mypassword");
FileSystemOptions opts = new FileSystemOptions();
DefaultFileSystemConfigBuilder.getInstance().setUserAuthenticator(opts, auth);
FileObject fo = VFS.getManager().resolveFile("smb://localhost/shared/test/#1.txt", opts);
System.out.println(fo.exists());
Turns out the full path was obtained as a Java URL, and the URL class strips the pound sign and whatever after. I made a fix and it is in the latest snapshot. Please give it a try and confirm.
Its picking up the file with latest snapshot but one strange thing I observied, is that its trimming the last character of the file. For eg.: if the file content has "hello world", then in my output xml file it just gives "hello worl"
1.txt The last character prob. is there in 2.8.0 version too, I just noticed now, I have attached the sample file which is getting crawled.
I was able to reproduce. Will fix.
A new snapshot was just created that fixes the character truncation. Please give it a try and confirm.
Hi, I tried with latest snapshot, but this time it returned me unreadable characters in content tag. I have attached the output xml with it. Input file is same which I attached last time having 'hello world' as its content. crawledFiles.zip
Looks like a character encoding detection issue. In your case it gets detected as IBM500 and since that's probably not the right charset, it reads in wrong.
You can test that on this site: http://string-functions.com/encodedecode.aspx Type "hello world" with encode UTF-8 and decode IBM500. You should get the same funky characters.
It may happen because there is not enough text in your test file to properly detect the encoding. Do you have this issue with larger files as well?
You can try to set the character encoding yourself as a pre-parse handler and see it it picks that instead. You can do so with the ConstantTagger
, like this:
<tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="replace" >
<constant name="document.contentEncoding">UTF-8</constant>
</tagger>
If the above does not work, you can try diabling parsing altogether if you are really dealing with simple unformatted text. You can also have a look at the CharsetTransformer
if that helps. Hopefully the first suggestion will work just fine.
Yeah..may be that was file issue. Anyway its now getting the last character as well. Any plans of releasing these issues: 1) file name containing # 2) last character of file getting trimmed out. 3) issue #24 4) AbstractCollector#getState() in the latest collector-core snapshot
Hi Parscal,
I found a issue is related to this one, so I post here.
If a file name contain #, it crawl fine, however when I read the file from commiter, the # convert to %23 eg. the original file is C:\Users\Administrator\Documents\works\Design_docs.#db_tables6.3.sql.1.10, when the reference become: file://C:/Users/Administrator/Documents/works/Design_docs/.%23db_tables6.3.sql.1.10
it caused file not found error. I cannot just replace the %23 back to #, since I cannot determine if original file really have %23 in the file name
A new snapshot version was just made, which will store local paths without encoding the #
. Please give it a try and confirm.
Fixes in this ticket are now in the official release.
Hi, I have put one simple text file having hash(#) in its name like
#1.txt
and when I try to crawl it using this path smb://localhost/shared/test, its not getting crawled. And when I try to crawl#1.txt
using absolute file path like "d:/shared/test" its getting crawled fine.Also when I rename it to 1.txt the file is discovered and is getting crawled fine with and without smb.
Is there any issue with
#
character in filename and smb protocol ?