Vectors created with Windows are different from the vector created using linux (pitt.search.semanticvecotrs.LSA)

GoogleCodeExporter commented 9 years ago

Here the steps to reproduce the problem:

1. Take a corpus.

2. build the corpus Lucene index using a windows machine 
(org.apache.lucene.demo.IndexFiles)
3. build the vectors .bin using semanticvector (pitt.search.semanticvecotrs.LSA)
4. convert the vector in a text form (for more human friendly version)

5. build the corpus (the same corpus) Lucene index using a linux machine 
(org.apache.lucene.demo.IndexFiles)
6. build the vectors .bin using semanticvector (pitt.search.semanticvecotrs.LSA)
7. convert the vector in a text form (for more human friendly version 

The expected output must be the same in both the operating systems, while it is 
completely different, the vectors built using Windows are completely differents 
from the vectors built using Linux.

I have tryed:
semanticvector  3.0 and 4.0.
Lucene. 3.0.3, 3.6.x, and 4.3.1.

The key issue is that lucene index depends of the documents order (different 
order creates different index) and documents order using Windows is different 
from the documents order using Linux (maybe due the file system orginisation).

Now, even though the lucene index is different, the semanticvector output must 
be the same because the semantic of the corpus is the same.

semantic vector must not take in cosideration the documents order information 
inside the lucene index.

The lucene method to arrange the order is indexDocs inside the 
org.apache.lucene.demo.IndexFiles class.

Original issue reported on code.google.com by massimo....@gmail.com on 4 Oct 2013 at 9:49

GoogleCodeExporter commented 9 years ago

I've failed to reproduce this - too many problems with classpaths on my Windows 
box :(

Is it actively hampering development, or is it mainly good to know that this 
happens?

Original comment by dwidd...@gmail.com on 14 Oct 2013 at 6:20

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

I think that the issue is quite crucial because we obtain different vectors 
just changing the documents indexing order.

You can reproduce the problem also using a single OS.

Look for org.apache.lucene.demo.IndexFiles class, than for 
indexDocs(IndexWriter writer, File file) method.
In this method just entry something like this to force a certain indexing order:

 if (file.isDirectory()) {
                String[] files = file.list();
             // to index the document in a natural order
                Arrays.sort(files);
               ...
}

Than run it within your linux OS and you will obtain a different vector.bin 
files for the same corpus.
Hope it is clear, for any question just let me know
Massimo

Original comment by massimo....@gmail.com on 15 Oct 2013 at 9:44

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I've marked this a low priority, I'm afraid. While it's possible to create 
different indexes on different OS's or by enforcing different orderings, it's 
not clear that this is a blocking issue. As far as I know, we don't have user 
groups who are forced to try and build a single index on a heterogeneous OS 
platform.

So yes, you can build different vectors - but I don't think anyone is forced 
to. Am I mistaken on this?

Original comment by dwidd...@gmail.com on 28 Oct 2013 at 5:23

Added labels: ****
Removed labels: ****

dileepajayakody / semanticvectors

Vectors created with Windows are different from the vector created using linux (pitt.search.semanticvecotrs.LSA) #71