apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.46k stars 980 forks source link

Explore GPU acceleration [LUCENE-7745] #8796

Open asfimport opened 7 years ago

asfimport commented 7 years ago

There are parts of Lucene that can potentially be speeded up if computations were to be offloaded from CPU to the GPU(s). With commodity GPUs having as high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to speed parts of Lucene (indexing, search).

First that comes to mind is spatial filtering, which is traditionally known to be a good candidate for GPU based speedup (esp. when complex polygons are involved). In the past, Mike McCandless has mentioned that "both initial indexing and merging are CPU/IO intensive, but they are very amenable to soaking up the hardware's concurrency."

I'm opening this issue as an exploratory task, suitable for a GSoC project. I volunteer to mentor any GSoC student willing to work on this this summer.


Migrated from LUCENE-7745 by Ishan Chattopadhyaya (@chatman), 2 votes, updated May 11 2020 Attachments: gpu-benchmarks.png, TermDisjunctionQuery.java

asfimport commented 7 years ago

vikash (migrated from JIRA)

Hi I am willing to work on this.

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Here are some ideas on things to start out with:

  1. Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a query point, return top-n nearest indexed points.
  2. Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a polygon (complex shape), return all points that lie inside the polygon.

In both the cases, compare performance against existing Lucene spatial search. One would need to choose the most suitable algorithm for doing these as efficiently as possible. Any GPGPU API can be used for now (OpenCL, CUDA) for initial exploration.

@dsmiley, @DaddyWri, @nknize, @mikemccand, given your depth and expertise in this area, do you have any suggestions? Any other area of Lucene that comes to mind which should be easiest to start with, in terms of exploring GPU based parallelization?

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Another experiment that, I think, is worth trying out:

For reference, Postgresql seems to have experienced speedup in some areas (esp. aggregations over column oriented fields): https://www.slideshare.net/kaigai/gpgpu-accelerates-postgresql

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I have a question to us all. (a) could whatever comes of this actually be contributed to Lucene itself given the likelihood of requiring native O.S. bindings (lets presume in spatial-extras as it seems this is the only module that can have an external dependency), and (b) does that matter for GSOC or to the expectations of the contributor? If (a) is a "no", we need to be honest up front with the contributor. I know in the past Solr has been denied off-heap filters that would have required a un-pure Java approach. A native binding would be another degree of un-purity :-)

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe even the basic hit scoring that e.g. BooleanScorer does with disjunction of high frequency terms, would be amenable to GPU acceleration? Today BooleanScorer processes a whole window of hits at once, doing fairly simple math (the Similarity methods) on each.

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

@dsmiley, that is a very important question. Afaik, there is no Apache compatible GPGPU framework. Both OpenCL and CUDA are likely incompatible with Apache (I am not fully sure). I see that jCUDA is MIT license, which is a wrapper around the native libraries.

If there are benefits to using GPGPU processing, my thought is that we can ensure all necessary plumbing in our codebase in order to offload processing to some plugin, whereby the user can plugin the exact GPU kernels from outside the Lucene distribution (if those kernels also violate any licensing restrictions we have). If there are clear benefits in speeding things up using a GPU, it would not be, for the end-user, the end of the world if the code comes outside Apache distribution.

If (a) is a "no", we need to be honest up front with the contributor.

That is a good point, and we can document this clearly.

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, in General, including CUDA into Lucene may be a good idea, but I see no real possibility to do this inside Lucene Core or any other module. My idea would be to add some abstraction to the relevant parts of Lucene and make it easier to "plug in" different implementations. Then this code could also be hosted outside Lucene (if Licenses is a problem) anywhere on Github.

We still should have the following in our head: Mike's example looks "simple" as a quick test if we see gains, but making the whole thing ready for commit or bundling in any project in/outside Lucene is a whole different story. Currently BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do the calculation that could possibly be parallelized. But for moving all this to CUDA, you have to add "plugin points" all there and change APIs completely. It is also hard to test, because none of our Jenkins servers has a GPU! Also for 0/8/15 users of Lucene, this could be a huge problem, if we add native stuff into Lucene that they may never use. Because of that it MUST BE SEPARATED from Lucene core. Completely...

IMHO, I'd create a full new search engine like CLucene in C code if I would solely focus on GPU parallelization. The current iterator based approaches are not easy to transform or plug into CUDA...

For the GSoc project, we should make sure to the GSoc student that this is just a project to "explore" GPU acceleration: if it brings any performance - I doubt that, because the call overhead between Java and CUDA is way too high - in contrast to Postgres where all in plain C/C++. The results would then be used to plan and investigate ways how to include that into Lucene as "plugin points" (e.g., as SPI modules).

asfimport commented 7 years ago

vikash (migrated from JIRA)

Hi all, I have been reading about GPU acceleration and in particular to be precise about GPU accelerated computing I find this project very interesting and so can anyone give me further lead what is to be done now? I mean the ideas that Ishaan suggested are pretty good but I am still not able to understand that what Mr David means by (a) could whatever comes of this actually be contributed to Lucene itself, whydo you think that you doubt that the outcome of this project not be contributed to Lucene.

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

vikash: not all working code contributed to any open-source project is necessarily welcome. Usually it is but sometimes project members or ASF rules insist on certain things for the perceived greater good. In this case, I believe Uwe doesn't want Lucene to include anything that would only work with certain hardware or JVM vendors – even if it was optional opt-in. If hypothetically nobody had such concerns here, be aware that any 3rd party (non-ASF) libraries need to meet certain qualifications. For example, if whatever Java CUDA library you find happens to be licensed as GPL, then it's incompatible with ASF run projects like this one. That's a hypothetical; I have no idea what Java CUDA libraries exist and what their licenses are. Regardless... if you come up with something useful, it's probably not necessary that Lucene itself change, and as seen here we have some willingness to change Lucene (details TBD) if it enables people to use Lucene with CUDA. Lucene has many extension points already. Though I could imagine you might unfortunately need to copy/fork some long source files – Uwe mentioned some.

Good luck.

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Hi Vikash,

Regarding licensing issue: The work done in this project would be exploratory. That code won't necessarily go into Lucene. When we are at a point where we see clear benefits from the work done here, we would then have to explore all aspects of productionizing it (including licensing).

Regarding next steps:

BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do the calculation that could possibly be parallelized.

  1. First, understand how BooleanScorer calls these similarity classes and does the scoring. There are unit tests in Lucene that can help you get there. This might help: https://wiki.apache.org/lucene-java/HowToContribute
  2. Write a standalone CUDA/OpenCL project that does the same processing on the GPU.
  3. Benchmark the speed of doing so on GPU vs. speed observed when doing the same through the BooleanScorer. Preferably, on a large resultset. Include time for copying results and scores in and out of the device memory from/to the main memory.
  4. Optimize step 2, if possible.

Once this is achieved (which in itself could be a sufficient GSoC project), one can have stretch goals to try out other parts of Lucene to optimize (e.g. spatial search).

Another stretch goal, if the results for optimizations are positive, could be to integrate the solution into Lucene. Most suitable way to do so would be to create hooks into Lucene so that plugins can be built to delegate parts of the processing to external code. And then, write a plugin (that uses jCuda, for example) and do an integration test.

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Java CUDA libraries exist and what their licenses

jCuda happens to be MIT, which is, afaik, compatible with Apache license. http://www.jcuda.org/License.txt

asfimport commented 7 years ago

vikash (migrated from JIRA)

Hello all, I have been reading a lot about GPU working and GPU parallelization in particularly about General Purpose computing on Graphics Processing Units and have also looked into in detail the source code of the BooleanScorer.java file , its a nice thing and I am having no difficulty understanding its working since Java is my speciality so the job was quite fun . There are a few things that seem unclear to me but I am reading and experimenting so I will resolve them soon. It is a nice idea to use gpu to perform the search and indexing operations on a document using the GPU and that would be faster using the GPU. And regarding the licencing issue, since we are generating code and as it was said above the code that we generate may not go to Lucene necessarily so assuming this happens then will licencing still be an issue if we use the libraries in our code? And as Uwe Schindler said we may host the code on github and certainly it would not be a good idea to develop code for special hardware, but still we can give it a try and try to make it compatible with most of the hardwares. I dont mind if this code does not go to Lucene, but we can try to change lucene and make it better and I am preparing myself for it and things would stay on track with your kind mentorship . So should I submit my proposal now or do I need to complete all the four steps that Ishaan told to do in the last comment and then submit my proposal? And which one of the ideas would you prefer to mentor me on that is which one do you think would be a better one to continue with?

>Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a query point, return top-n nearest indexed points. >Copy over and index lots of points and corresponding docids to the GPU as an offline, one time operation. Then, given a polygon (complex shape), return all points that lie inside the polygon. >Benchmarking an aggregation over a DocValues field and comparing the corresponding performance when executed on the GPU. >Benchmarking the speed of calculations on GPU vs. speed observed when doing the same through the BooleanScorer. Preferably, on a large result set with the time for copying results and scores in and out of the device memory from/to the main memory included? -Vikash

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Hi Vikash, I suggest you read the student manuals for GSoC. Submit a proposal how you want to approach this project, including technical details (as much as possible) and detailed timelines.

Regarding the following:

1    First, understand how BooleanScorer calls these similarity classes and does the scoring. There are unit tests in Lucene that can help you get there. This might help: https://wiki.apache.org/lucene-java/HowToContribute
2    Write a standalone CUDA/OpenCL project that does the same processing on the GPU.
3    Benchmark the speed of doing so on GPU vs. speed observed when doing the same through the BooleanScorer. Preferably, on a large resultset. Include time for copying results and scores in and out of the device memory from/to the main memory.
 4   Optimize step 2, if possible.

If you've already understood step 1, feel free to make a proposal on how you will use your GSoC coding time to achieve steps 2-4. Also, you can look at other stretch goals to be included in the coding time. I would consider that steps 2-4, if done properly and successfully, is itself a good GSoC contribution. And if these steps are done properly, then either Lucene integration can be proposed for the latter part of the coding phase (last 2-3 weeks, I'd think), or exploratory work on other part of Lucene (apart from the BooleanScorer, e.g. spatial search filtering etc.) could be taken up.

Time is running out, so kindly submit a proposal as soon as possible. You can submit a draft first, have one of us review it and then submit it as final after the review. If the deadline is too close, there might not be enough time for this round of review, and in such a case just submit the draft as final.

Also, remember a lot of the GPGPU coding is done on C, so familiarity/experience with that is a plus.

(Just a suggestion that makes sense to me, and feel free to ignore: bullet points work better than long paragraphs, even though the length of sentences can remain the same)

asfimport commented 7 years ago

vikash (migrated from JIRA)

Yeah I had already read the student manual and the deadline is 3rd April and its too close, in the preparation I had almost missed the deadline for application. OK so for the proposal my draft is here (you may comment on it and I will do the needful)

https://docs.google.com/document/d/1HGxU1ZudNdAboj0s9WKTWJk3anbZm86JY-abaflXoEI/edit?usp=sharing .

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

I have left initial comments on your draft. Let me know if you want another round of review, perhaps after you've addressed the current comments.

asfimport commented 7 years ago

vikash (migrated from JIRA)

Hi Ishaan , I have changed the proposal according to your instructions, can you review it again?

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Hi Vikash, I have reviewed the proposal. It is still extremely disorganized and it is not clear what your goals are and how you have split them up into tasks. It contains lots of copy paste of comments/statements from this JIRA or comments from the proposal itself. The level of details still seems inadequate to me.

I had proposed a possible way to structure your proposal (by splitting the three months into three different areas of focus, all of them I specified in the comments), but I don't see that you've done so. I asked you to find out, at least, what the default Similarity in Lucene is called (and to attempt to simulate the scoring for that on the GPU). It seems you have not done so.

At this point, I don't think much can be done (just 2 hours to go for submission). Wish you the best.

asfimport commented 7 years ago

vikash (migrated from JIRA)

It is my First GSOC and so it was a bit difficult for me to draft the proposal properly.

asfimport commented 7 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

If you still haven't submitted your proposal, I have an idea for you to improve your chances. Include a link to a github repository in the application for your initial experiments. After that, you can try to build a prototype in the next few days (until assessment starts) that demonstrates that you are on the right track. This is not strictly necessary, but just throwing out an idea that might benefit you.

All the best and regards!

asfimport commented 7 years ago

vikash (migrated from JIRA)

oops i could not do that, i submitted my proposal, and if you check it now the latest edited format is the submitted version I made some changes to it again before submitting it and sadly i could not change the github link, it only points to my home directory in github, but can I start working still now and I shall give you the link that has my working and if it would be possible for you , you could show to the Apache Software Foundation my works, will that be ok? And since as I have said in my proposal that I will work from april itself so I will do some working and so will the repository I build for lucene and work I store there be checked by ASF by visiting my profile and navigating to the lucene repository i create there? can that help me increase my chances? And by whom will my proposal be checked?

asfimport commented 6 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Here [0] are some very initial experiments that I ran, along with Kishore Angani, a colleague at Unbxd.

  1. Generic problem: Given a result set (of document hits) and a scoring function, return a sorted list of documents along with the computed scores (which may leverage one or more indexed fields).
  2. Specific problem: Given (up to 11M) points and associated docids, compute the distance from a given query point. Return the sorted list of documents based on these distances.
  3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed library), called from JNI wrapper. Timings include copying data (scores and sorted docids) back from GPU to host system and access from Java (via DirectByteBuffer).
  4. CPU implementation was based on SpatialExample [1], which is perhaps not the fastest (points fields are better, I think).
  5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in \~50ms!. Here, blue is GPU and red is CPU (Lucene).

gpu-benchmarks.png

[0] - https://github.com/chatman/gpu-benchmarks [1] - https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

David, I'm not sure this was meant to be specific to lucene/spatial, Mark only mentioned it as a way to conduct an initial benchmark? The main thing that we identified as being a potential candidate for integration with Cuda is actually BooleanScorer (BS1, the one that does scoring in bulk) based on previous comments?

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Mark who? You must mean Ishan? I think that if GPUs are used to accelerate different things, then they would get separate issues and not be lumped under one issue. Does that sound reasonable? Granted the problem posted started off as a bit of an umbrella ticket and perhaps the particular proposal Ishan is presenting in his most recent comment ought to go in a new issue specific to spatial. Accelerating Haversine calculations sounds way different to me than BooleanScorer stuff; no?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Not sure why I confused names, I meant Ishan indeed. Sorry for that. I'll let Ishan decide how he wants to manage this issue, I'm personally fine either way, I'm mostly following. :) It just caught me by surprise since I was under the impression that we were still exploring which areas might benefit from GPU acceleration.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

np. Oh this caught me by surprise too! I though this was about BooleanScorer or postings or something and then low and behold it's spatial – and then I thought this is so non-obvious by the issue title. So I thought it'd do a little JIRA gardening.

asfimport commented 6 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Ah, I think I wasn't clear on my intentions behind those numbers.  

if it brings any performance - I doubt that, because the call overhead between Java and CUDA is way too high - in contrast to Postgres where all in plain C/C++

I wanted to start with those experiments just to prove to myself that there are no significant overheads or bottlenecks (as we've feared in the past) and that there can be clear benefits to be realized.

I wanted to try bulk scoring, and chose the distance calculation and sorting as an example because (1) it leverages two fields, (2) it was fairly isolated & easy to try out.

In practical usecases of spatial search, the spatial filtering doesn't require score calculation & sorting on the entire dataset (just those documents that are in the vicinity of the user point, filtered down by the geohash or bkd tree node); so in some sense I was trying out an absolute worst case of Lucene spatial search.

Now, that I'm convinced that this overall approach works and overheads are low, I can now move on to looking at Lucene internals, maybe starting with scoring in general (BooleanScorer, for example). Other parts of Lucene/Solr that might see benefit could be streaming expressions (since they seem computation heavy), LTR re-ranking etc.

Actually incorporating all these benefits into Lucene would require considerable effort, and we can open subsequent JIRAs once we've had a chance to explore them separately. Till then, I'm inclined to keep this issue as a kitchen sink for all-things-GPU, if that makes sense?

asfimport commented 5 years ago

Rinka Singh (migrated from JIRA)

Hi everyone,

I wanted to check if this issue was still open.  I have been experimenting with CUDA for a bit and would love to take a stab at this.

A few thoughts:

If you guys like, I can write a brief (one or two paras) description of what is possible for indexing, searching, filtering (with zero knowledge of Lucene of course) to start off...

Your thoughts please...

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

(Unrelated to your comment Rinka, but seeing activity on this issue reminded me that I wanted to share something) There are limited use-cases for GPU accelelation in Lucene due to the fact that query processing is full of branches, especially since we added support for impacts and WAND. That said Mike initially mentioned that BooleanScorer might be one scorer that could benefit from GPU acceleration as it scores large blocks of documents at once. I just attached a specialization of a disjunction over term queries that should make it easy to experiment with Cuda, see the TODO in the end on top of the computeScores method.

asfimport commented 5 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Hi Rinka, Kishore and I made some progress on this and presented our current state of this initiative here: https://www.youtube.com/watch?v=cY_4ApOAVJQ The code is not worth a patch right now, but will soon have something. I shall update on the latest state here as soon as I find myself some time (winding down from a hectic Black Friday/Cyber Monday support schedule).

asfimport commented 5 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Your thoughts please...

Thanks for your interest in this. Seems like your proposed ideas are very much inline with our approach that we're trying out as well. There are some initial experiments and results that we are performing as we speak, and I can see that there are benefits in the niche usecases that Adrien mentioned.

asfimport commented 5 years ago

Rinka Singh (migrated from JIRA)

> The code is not worth a patch right now, but will soon have something. I shall update on the latest state > here as soon as I find myself some time (winding down from a hectic Black Friday/Cyber Monday support schedule).

Do you think I could take a look at the code, I could do a quick review and perhaps add a bit of value. I'm fine if the code is in dev state.

Would you have written up something to describe what you are doing?

asfimport commented 5 years ago

Rinka Singh (migrated from JIRA)

@jpountz

(Unrelated to your comment Rinka, but seeing activity on this issue reminded me that I wanted to share something) There are limited use-cases for GPU accelelation in Lucene due to the fact that query processing is full of branches, especially since we added support for impacts and WAND.

While Yes branches do impact the performance, well designed (GPU) code will consist of a combo of both CPU (the decision making part) and GPU code. For example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration and I also identified further code areas that can be improved. I'm fairly sure (gut feel), I can squeeze out a 40-50x kind of improvement at the very least on a mid-sized GPU (given the time etc.,). I think things will be much, much better on a high end GPU and with further scale-up on a multi-gpu system...

My point is - thinking (GPU) parallel is a completely different ball-game and requires a mind-shift. Once that happens, the value add will be massive and gut tells me Lucene is a huge opportunity.

Incidentally, this is why I want to develop a library that I can put out there for integration.

That said Mike initially mentioned that BooleanScorer might be one scorer that could benefit from GPU acceleration as it scores large blocks of documents at once. I just attached a specialization of a disjunction over term queries that should make it easy to experiment with Cuda, see the TODO in the end on top of the computeScores method.

Lucene is really new to me (and so is working with Apache - sorry, I am a newbie to Apache) :). Please will you put links here...

asfimport commented 5 years ago

Rinka Singh (migrated from JIRA)

Edited. Sorry...

A few questions.

Assumptions (please validate):

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

How critical is the inverted index to the user experience?

Completely: almost all queries run on the inverted index. Unlike other datastores that run queries via linear scans and allow to speed things up by building indices, Lucene only enables querying vian an index.

What happens if the inverted index is speeded up?

Then most queries get a speed up too.

How many AWS instances would usually be used for searching through \~140GB sized inverted index

Hard to tell, it depends on your search load, how expensive queries are, etc.

asfimport commented 5 years ago

Rinka Singh (migrated from JIRA)

Thank you. As a first step let me come up with a GPU based index builder and we can look at query handling on the GPU as a second step. :-) I AM going to be slooow - my apologies but I'm interested enough to put effort into this and will do my best.

Here's what I'll do as a first step: Develop a stand alone executable (we can figure out modifications to directly use in Lucene as step 1.1) that will: a. Read multiple files (command line) and come up with an inverted index b. write the inverted index to stdout (I can generate a lucene index as step 1.1) c. will handle a stop-words file as a command line param d. Will work on one GPU+1 thread of the CPU (I'll keep multi-GPU and multi threading to use all CPUs in mind but implementing that will be a separate step altogether).

Goal: Look at speed difference between an Index generated on the CPU vs GPU for just this.

We can build from there... Thoughts please...

asfimport commented 5 years ago

Joshua Mack (migrated from JIRA)

Hello all,

There's a chance that I may end up being a lucky graduate student tasked with handling something similar to this a few months down the line with application to a largely textual Kafka+Logstash+Elastic pipeline

@chatman, I've noticed that the Github links above/referenced in the Activate 2018 talk now lead to 404s – I understand that it's not intended to be taken as the defacto method by which functionality like this should be added to Lucene, but having a baseline to look into with some of our investigations would certainly be helpful and could save us (myself) some time. If you'd be willing to rehost somewhere or provide updated results, I would certainly appreciate it

Similarly, Rinka Singh, coming from a naive outsider, I think your plan sounds good, and if you have any preliminary investigations you've performed, I would appreciate being able to learn from your lessons thus far. From my perspective, I have experience with GPUs and Solr separately, but I've never dealt with them together (or with Lucene directly for that matter)

Finally, we're still in a very investigative phase and I'm certainly not the PI, but assuming this ends up being relevant to our project, I'm sure we would be happy to contribute back whatever we end up doing as a part of the project

asfimport commented 5 years ago

Rinka Singh (migrated from JIRA)

Hi Joshua Mack, All, A quick update. I've been going really slow (sorry 'bout that). My day job has consumed a lot of my time. Also, what I have working is histogramming (on text files) on GPUs - the problem with that is that it is horrendously slow - I use the GPU global memory all the way (it is just about 4-5 times faster than a CPU) instead of sorting in local memory. I've been trying to accelerate that before converting it into an inverted index. Nearly there :) you know how that is - the almost there syndrome... Once I get it done, I'll check it into my github.

Here's the lessons I learned in my journey:

  1. Do all the decision making in the CPU.  See if parallelization can substitute for decision making - you need to think parallelization/optimization as part of design not as an after-thought - this is counter to what everyone says about optimization.  The reason is there could be SIGNIFICANT design changes.
  2. Do as fine grained parallelism as possible.  Don't think - one cpu-thread == one gpu-thread.  think as parallel as possible.
  3. The best metaphor I found (of working with GPUs) - think of it as an embedded board attached to your machine and you move data to and fro from the board, debug on the board.  Dump all parallel processing on the board and sequential on the CPU.
  4. Read the nVidia Manuals (they are your best bet).  I figured it is better to stay with CUDA (as against OpenCL) given the wealth of CUDA info and support out there...
  5. Writing code:
    1. explicitly think about cache memory (that's your shared memory) and registers (local memory) and manage them.  This is COMPLETELY different from writing CPU code - the compiler does this for you.
    2. Try to used const, shared and register memory as much as possible.  Avoid __syncthreads () if you can.
    3. Here's where the dragons lie...
  6. Engineer productivity is roughly 1/10th the normal productivity (And I mean writing C/C++ not python).  I've written and thrown away code umpteen times - something that I just wouldn't need to do when writing standard code.
  7. Added on 6/19: As a metaphor think about this as VHDL/Verilog programming (without the timing constructs - timing is not a major issue here since the threads on the device execute in (almost) lockstep).

Having said all this, :) I've a bunch of limitations that a regular software engineer will not have and have been struggling to get over them.  I've been a manager for way too long and find it really difficult to focus on just one thing (the standard ADHD that most managers eventually develop).  Also, I WAS a C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth to pick C++ up and then let's not even talk about my day job pressures - I do this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review documents/design/thoughts/anything, please feel free to mail them to me at: rinka (dot) singh (at) gmail.....  At least ping me - I really don't look at the Apache messages and would probably miss something...

Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So please forgive me if I get the processes and systems wrong...  I'm trying to learn.

asfimport commented 5 years ago

Joshua Mack (migrated from JIRA)

Sounds good!

Re: efficient histogram implementation in CUDA

If it helps, this approach has been good for a balance between GPU performance and ease of implementation for work I've done in the past. If academic paywalls block you for all those results, it looks to also be available (presumably by the authors) on researchgate

The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.

To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each thread block, and they offset them so that bin 0 of the 1st histogram falls into a different memory bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, it improves throughput in the degenerate case where multiple threads are trying to accumulate the same histogram bin at the same time.

asfimport commented 5 years ago

Rinka Singh (migrated from JIRA)

The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each thread block, and they offset them so that bin 0 of the 1st histogram falls into a different memory bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, it improves throughput in the degenerate case where multiple threads are trying to accumulate the same histogram bin at the same time.

So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU

The advantages of this approach (to my mind) are:

 

The issues are:

 

Currently in process:

 

Re: efficient histogram implementation in CUDA

If it helps, this approach has been good for a balance between GPU performance and ease of implementation for work I've done in the past. If academic paywalls block you for all those results, it looks to also be available (presumably by the authors) on researchgate

 Took a quick look - they are all priced products.  I will take a look at researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are in the middle of a release at work and also my night time job (this).

asfimport commented 4 years ago

Rinka Singh (migrated from JIRA)

A very quick update on this.  I have been progressing on the histogram and should take another month or so to put something out.  Basically.

  1. It is a sorted histogram of words.
  2. Like I said before, I can scale to the size of the device memory (I think 1/6th the total free mem) for one card
  3. Again, (:) hopefully) it should be straightforward to save the host memory location + the word to the stored data.

Thanks,

asfimport commented 4 years ago

Rinka Singh (migrated from JIRA)

Another update.

Sorry about the delay.  It took me SIGNIFICANTLY longer (multiple race conditions, debugging on the GPU etc., etc.,) than I anticipated but I think I have the histogram running:

I'll release the code on my github in a day or two with data on how to compile & run this (I'll include both the data & the stop-word files). I'll put the link here.

I'd love to hear how these numbers relate to running an equivalent histogram on a CPU cluster.  Please can someone run this and let me know.  Also if someone can provide me with a V100 based Instance (even an AWS instance is fine), I can run it there (as is) and generate some numbers.

Underlying assumption: Code is working fine (this can be a bad assumption since I have done just enough testing for me to process one file - code works with a small set of boundary conditions and this is not something that I would deploy at this point). I was more focused on getting it out than doing extensive testing.

Next steps:

I'll start updating the code to;

But I'd like feedback on this before going down this rabbit hole.

Also, once I have an initial inverted index in place, I'd love to have contributors.  There is a lot of contribution that even a CPU developer can do - one look at the code will tell you.  If you can help, please do reach out to me. One major advantage for the contributor will be learning GPU programming - and trust me on this, that's a completely different ball game altogether.

asfimport commented 4 years ago

Rinka Singh (migrated from JIRA)

NDEBUG/Optimized version for Quadro 2000 GPU (192 cores) + Intel Dual Core + Kubuntu 14.04, CUDA 7.5:  350.774 sec..

I think its time to release the code...

 

asfimport commented 4 years ago

Rinka Singh (migrated from JIRA)

OK. Here's the basic histogramming code: https://github.com/rinka-meltiron/MiLib

Looking forward to feedback.

One correction to my inputs in the previous posts.  The test file was 2.5MB of text, not 5 MB of text.  I'm sorry, I can't include that file here as it is copywrited...  I have put other files that I've included for testing.

asfimport commented 4 years ago

Rinka Singh (migrated from JIRA)

Just wanted to give a quick feedback.  I'm adding features that will make it an index.  Also some defects.  Will take me a bit of time as my day work intrudes on this.