PRUNERS / archer

Archer, a data race detection tool for large OpenMP applications
https://pruners.github.io/archer
Apache License 2.0
62 stars 13 forks source link

Multi-level data-race reduction architecture to extract distinct races #13

Open dongahn opened 7 years ago

dongahn commented 7 years ago

I am moving this issue from #3:

This has been common feedback from our early adopters. Instead of seeing individual data race instances, they want to be able to see "unique data races." This runs in multiple dimensions. E.g., they want to be able to aggregate and reduce races from many reports gathered from running a regression test suite; they want to be able to aggregate and reduce races across multiple processes from a single MPI application run.

So far, the suggestion was to look at the current reporting routine within ThreadSanitizer and see if we can open up its protocol to add custom reduction engine.

dongahn commented 7 years ago

From @SumedhArani:

I've been going through the source code of reporting routine of tsan and what @dongahn suggests can be certainly done and does add a benefit of doing a in situ reduction of data races instead of parsing it after the reports are outputted.

As per my known knowledge of tsan reporting routine and the data races I'm exposed to, my plan would be(as per my observation)

To store the immediate return stack i.e where the race is taking place in a data structure which will contain two attributes (read location and write location) To compare any further of data race being reported with the ones stored earlier If repeated, don't print I was thinking as to having dictionary with key as the the read and write location and value a vector with all the repeated instance and then eventually print the key or somewhere on these lines. I do have a road map(plan) as to how to do the above task. Is it possible to have a much more detailed discussion say via email or Slack @dongahn @simoatze ?

dongahn commented 7 years ago

From @dongahn: Sounds like progress!

This sounds right. I think a key is to make this a callback architecture. One can have a default function as you described which does a reduction over the code location. But some ppl may want to have a bit finer granularity differentating the code location that came from different call paths... A callback design should allow multilevel reduction, I think.

dongahn commented 7 years ago

From @SumedhArani:

@dongahn Could you please elaborate as I didn't understand clearly

But some ppl may want to have a bit finer granularity differentating the code location that came from different call paths

Also like what sort of multilevel reduction are you thinking

A callback design should allow multilevel reduction

A callback design suggested by you is definitely advantageous in the situation as suggested by you but I could not understand as to what issues were you actually thinking. I mean to ask these questions so as to I can understand the various scenarios and cleanly handle them before designing a solution.

jprotze commented 7 years ago

I already have a prototype, that fetches the output and redirects the output into MUST's aggregation engine. It's not perfect. I think we can introduce additional filters to specify similarities, that should be aggregated.

I also submitted a patch for review, to be able to redirect the output at runtime, it was not accepted yet: https://reviews.llvm.org/D15154

Am 09.02.2017 um 22:42 schrieb Dong H. Ahn:

I am moving this issue from #3 https://github.com/PRUNERS/archer/issues/3:

This has been common feedback from from our early adopters. Instead of seeing individual data race instances, they want to be able to see "unique data races." This runs in multiple dimensions. E.g., they want to be able to aggregate and reduce races from many reported gathered from running a regression test suite; they want to be able to aggregate and reduce races across multiple processes from a single MPI application run.

So far, the suggestion was to look at the current reporting routine within the ThreadSanitizer and see if we can open up its protocol to add custom reduction engine.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PRUNERS/archer/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSnhpS9-mt0dVMqz0irDFyCSVvx_ohWks5ra4hLgaJpZM4L8s-q.

-- Dipl.-Inf. Joachim Protze

IT Center Group: High Performance Computing Division: Computational Science and Engineering RWTH Aachen University Seffenter Weg 23 D 52074 Aachen (Germany) Tel: +49 241 80- 24765 Fax: +49 241 80-624765 protze@itc.rwth-aachen.de www.itc.rwth-aachen.de

dongahn commented 7 years ago

Thinking about this a bit more, I am not sure if the callback is the best mechanism. But what I was thinking is have a way for people like us to register a function that can perform aggregation and reduction at various levels. Using the same hook, we will have a default aggregator but this shouldn't prevent others from registering their own instead.

How does the function prototype of TSan report look like? If the data on races are passed into this function as a function argument, perhaps we can simply make the report function "weak symbol" so that one can override it w/ his function defined as a strong symbol?

I can imagine, by default we have three different aggregators.

  1. aggregate the races from the target process and output
  2. preprocess the races from the target process and output that into an intermediate form so that a post processing engine can later combine it with other intermediate outputs to generate a report
  3. MPI-based aggregator that aggregates the races from all of the processes from the target MPI application.
dongahn commented 7 years ago

@jprotze: oh nice! It sounds like we already have a baseline to extend :-)

jprotze commented 7 years ago

The function that you register with SetPrintfAndReportCallback(void (callback)(const char )); is called multiple times per race report just like you would call printf in your code.

The structure of the reports is pretty easy to parse. Each single report has a ==== bar at the begin and end. Thus the callback can collect the strings and trigger postprocessing once the end of a report is detected:

#include <stdio.h>
#include <string>

void myPrintCallback(const char * str)
{
  static std::string buffer="";
  buffer += str;
  std::size_t found = buffer.find("==================",20);
  if (found !=std::string::npos)
  {
    printf("<myCallback>%s</myCallback>",buffer.c_str());
    buffer = "";
  }
}

namespace __sanitizer{
  void __attribute__((weak)) SetPrintfAndReportCallback(void (*callback)(const char *)) {}
}

void __attribute__ ((constructor))  my_init(void)
{
  __sanitizer::SetPrintfAndReportCallback(myPrintCallback);  
} 
SumedhArani commented 7 years ago

After following the thread of posts, I've a few questions that remain unclear to me after a quick google search.

  1. @jprotze , what's a MUST aggregation engine?
  2. @dongahn

But what I was thinking is have a way for people like us to register a function that can perform aggregation and reduction at various levels.

Can you help me out here as to which are the various levels are you referring ? By levels, is it aggregations and reduction at different stages?

  1. TSan reporting routine has the following function prototype void PrintReport(const ReportDesc *rep)

where ReportDesc is a class with the following outlook

class ReportDesc {
 public:
  ReportType typ;
  Vector<ReportStack*> stacks; /*Memory location I intend to use to extract unique races*/
  Vector<ReportMop*> mops; /*Memory operations - Read/Write*/
  Vector<ReportLocation*> locs; /*where the race occurred and where did the memory operation happened*/
  Vector<ReportMutex*> mutexes;
  Vector<ReportThread*> threads; /* which thread performed which operation */
  Vector<int> unique_tids;
  ReportStack *sleep;
  int count;

  ReportDesc();
  ~ReportDesc();

 private:
  ReportDesc(const ReportDesc&);

What I could figure out as a way to extract unique race was that if we check the location at which the race reported by TSAN on read and write but shows different races because it was executed by different threads. My context here being the sample code in Readme.md. Probably storing this memory location and then when it calls a PrintReport which gets called multiple times per race report for the next race, we cross check our already found data races and if the read write happens to be at the same location, we don't output it. As an alternative, we can store these repetitive data race in say a key value store.

So @dongahn, when you suggested to use callbacks, I thought of having a callback to my own function(PrintLocation as it is the part of the code where I manipulate) in PrintReport instead of it calling to the original PrintLocation which doesn't extract unique races. ( I know that I have used to many specific references to the TSan reporting routine but I've tried to illustrate what my thinking is to get clarity. )

Hope I'm getting the problem correctly understood.

@jprotze Is it that I need to call SetPrintfAndReportCallback instead of Printf for post processing? But then @dongahn, doesn't it become similar to the first question that I had initially popped and you had suggested that finding a hook in TSan reporting routine is better compared to post parsing of the generated race reports ?

jprotze commented 7 years ago

With MUST I refered to our MPI correctness-checking tool: https://www.itc.rwth-aachen.de/MUST It already does aggregation of detected errors across MPI ranks. Currently it only collects MPI specific issues. At some point I planned to integrate TSan into the MUST workflow and collect TSan output with above code.

The codesnippet above: When you compile and link the code against an application, it fetches output done by TSan (like a pipe) and redirects it into the myPrintCallback function. Instead of the printf in myPrintCallback you would call the postprocessing function for a single TSan-report.

dongahn commented 7 years ago

@SumedhArani: by multilevel, I mean an ability to compare two race instances and treat them as "equivalent" or "distinct" with different criteria.

One special case would be to reduce those instances that have exactly same backtrace but accessing threads are different into an equivalence class (Your case.) But it doesn't have to be the only reduction criteria. In fact, in many case different users will want to set "equivalence" criteria differently.

One may want to treat two race instances which differing call paths except for the leaf point of execution (e.g., line number) as equivalent and merge them together

One may want to all but the first few root functions to be the same to become equivalent. This will be common when outputs come from two different test programs.

The idea would be have a way for users to register a merge logic what they want while providing our own merge functions for those users who can live w/ our predesigned criteria.

I think what might be helpful is for you to do a quick prototype simple with one merge function to do reduction as you proposed. Then, we can take a look at it to see how one can extend this to what we want?

jprotze commented 7 years ago

Especially if you look at execution of OpenMP explicit tasks, you will find that tasks, that execute the same lexical code, which might also be created in the same context, have completely different backtraces. Also the backtrace can differ in depth. One approach might be to make TSan aware of OpenMP backtraces. But for now, merging reports by leaf point of execution ad suggested by @dongahn should be fine.

SumedhArani commented 7 years ago

Especially if you look at execution of OpenMP explicit tasks, you will find that tasks, that execute the same lexical code, which might also be created in the same context, have completely different backtraces. Also the backtrace can differ in depth

@jprotze , I did notice that even for the sample program (readme.md) where the master thread has a different backtrace (although similar). I actually get the leaf point of execution from the backtrace itself i.e the top of the stack. The read/writes although happen at different locations in memory (different addresses) because the example in consideration happens to be an array location. So I think this can be an alternative to the merge logic?

@dongahn , I'll surely try to get a quick prototype up and running. I've my exams in the coming week so I'll try to post the prototype ASAP.

Thanks, Sumedh.

dongahn commented 7 years ago

@SumedhArani: Yes, let's go with the same leaf point as the criteria to begin with and then try to find ways to program different criteria. It feels like this project will be pretty interesting!

SumedhArani commented 7 years ago

Yeah me too @dongahn !! Already getting to learn so much!! Will deliver my best and it's been a pleasure to interact with everyone!:)

SumedhArani commented 7 years ago

Quick update: Using function overloading to handle 4 scenarios as of now

  1. Same leaf point of execution
  2. Same back trace
  3. Same memory operation at a location(Data race in case of arrays will not be reduced by this method as it checks for virtual memory address)
  4. No reduction at all

The developer can suitably make a call allowing the person to call the report function as per his needs.

SumedhArani commented 7 years ago

A working prototype is ready for reduction on same memory location. Now with a little tweaking, I can get it working for the other scenarios.

https://github.com/SumedhArani/Unique-Races

Files to be looked at:

lib/tsan/rtl/tsan_rtl_report.cc lib/tsan/rtl/tsan_report.cc lib/tsan/rtl/tsan_report.h

These are the three files that I've changed.

As of now I've uploaded the code in my repository. Please have a look and let me know as to how otherwise would you suggest to proceed.

Right now it does handle when data race is occurring at a specific location. Sample code that was tested for.

#define N 1000

int main (int argc, char **argv)
{
  int a;

#pragma omp parallel for
  for (int i = 0; i < N - 1; i++) {
    a=i;
  }
  return 0;
}

Here, normally the TSan would report three warnings(My system uses four threads) regarding different threads trying to update the location of variable a.

I've filtered them to one.

I've used function overloading as of now to allow one to register a function call as per his suitable needs.

If you find the work satisfactory, I'll code for the rest of them in a similar manner.

Thanks, Sumedh.

dongahn commented 7 years ago

Hey @SumedhArani: Thank you for the quick response! It is kind of hard to review what has changed without diffs. How about we do this.

  1. Fork the vanilla TSan runtime (i.e., your base line) into the new PRUNERS repo (If you don't have permission, @simoatze can help).
  2. You then fork it again to your local space and create a branch like "race-aggregation"
  3. Then, you can put your changes into this branch after which you can submit a pull request on the PRUNERS repo, tagging "experimental PR for initial review"

This can help us review your codes by looking at diffs and you can get more meaningful reviews.

At first glance, some important software maintenance considerations:

simoatze commented 7 years ago

Hey @SumedhArani, I just forked the "compiler-rt" from the llvm-mirror. I sent you an invitation as a collaborator on that repository, let me know if you are able to read and write the repo. Also, I would keep the master branch as it is (so we can keep it updated with the main LLVM/Compiler-RT repo) and you can create you own branch for your edits.

SumedhArani commented 7 years ago

Hey @simoatze , Thanks!! I did get the read and write permissions. I did put in a PR but only to later realise that I had not seen your comment before.

@dongahn , I'll surely try to stick the software maintenance considerations you've pointed out.

I'm not exposed to TRAVIS-CI, but I'll look it up and lets see what test cases can I come up with.

The prototype is up and running.

Have a check and let me know. :) https://github.com/PRUNERS/compiler-rt/tree/race-aggregation

One more question: At what times are you people available? So that I can put in or send a reply or ask something within this time frame And because we live in two corners of this world with different time zones.

Thanks, Sumedh.

simoatze commented 7 years ago

@SumedhArani @dongahn Should we close this on Archer since we have everything in the new repo now?

dongahn commented 7 years ago

I prefer keeping this open until we get a complete solution merged into Archer. I think the current solution need to grow a bit to meet the use case we described it here. If this issue becomes too long, we may split the into multiple manageable topics but let's keep this open for a while.

SumedhArani commented 7 years ago

@dongahn , If you find the solution for the scenario where it reduces on the basis of memory location acceptable, then on similar lines I'll code for the remaining scenarios that are

  1. Same backtrace
  2. Same leaf point of execution

The remaining two scenarios that are handled as of now are

  1. Same memory location of variable
  2. No reduction at all

My point being that we can extend the solution to handle the first two cases also which will have a similar design to that of the PR submitted now.

and @simoatze anyway is good for me! 👍

Thanks, Sumedh.