amiarafat / jatetoolkit

Automatically exported from code.google.com/p/jatetoolkit
2 stars 0 forks source link

Multithreading for GlobalIndex #7

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
This is a question/feature request. In my testing, it seems like the main 
bottleneck is the building of the GlobalIndex, as opposed to using the 
FeatureBuilder classes for doing the counts. However, while there is a 
multithread version of the FeatureBuilder classes, there is none for the 
GlobalIndex builders. Are there plans to implement parallel versions of these 
builders? I am not very experienced with Java, but I might try to implement 
them if it is feasible to do so.

Matt

Original issue reported on code.google.com by matthew....@gmail.com on 12 Sep 2014 at 1:15

GoogleCodeExporter commented 9 years ago
Hi Matthew I am really sorry for replying so late.
You have a valid point, and I will look into this for the next version. 

However the current issue with this project is that I have almost nil time that 
can be dedicated to jate regularly due to work commitment. I can only work on 
this in my spare time so I really cannot guarantee when this will be done. But 
yes definitely I will look into this.

Original comment by ziqizhan...@googlemail.com on 11 Dec 2014 at 11:22

GoogleCodeExporter commented 9 years ago
Thanks Ziqi. I was actually able to implement a multithreaded GlobalIndexMem 
using ConcurrentHashMap and modifications of the GlobalIndexBuilderMem class. I 
am still facing a bottleneck with disk I/O, so I tend to build the NP lists 
sequentially and then distribute the index building from those lists. For 
future versions, it might make sense to read the documents into memory to allow 
for parallel reading (although, of course, the application becomes 
significantly more RAM intensive).

Another area where parallelism is super useful is the variant updater, because 
there are so many combinations. A multithreaded version of that functionality 
was easier to implement, since it's all in-memory data structures.

My code is currently kind of a mess, but I can share it when I get some time.

Matt

Original comment by matthew....@gmail.com on 11 Dec 2014 at 2:42