[Doubts] Regarding data simulation

Ashish0804 commented 3 years ago

Hello, I have a few doubts regarding data simulation:

Why are the customer as well as the terminal coordinates uniform? Shouldn't they be distributed on the basis of actual population density and be actual coordinates instead of being between 0 and 100? I think Europe can be fit in a rectangle and population densities can be used to get clustered data.
Is there a sweet spot for n_customers to n_terminals ratio?
The radius r is set to 5, which corresponds to around 100 available terminals for each customer.

Is there a specific reason to use 100 terminals/customer?

Scenario 3: Every day, a list of 3 customers is drawn at random. In the next 14 days, 1/3 of their transactions have their amounts multiplied by 5 and marked as fraudulent.

Is there a specific reason to use 14 days? Or maybe base this number on real world data, because i doubt it will take 14 days for a customer to notice unwanted transactions on his card.

Yannael commented 3 years ago

Hello,

We agree that the data simulator could be made more realistic. However, as stated in Chapter 3/Section 2, the simple design that we propose is a choice, where we try to find a reasonable balance between simplicity, interpretability, challenge and realism:

"A simulation is necessarily an approximation of reality. Compared to the complexity of the dynamics underlying real-world payment card transaction data, the data simulator that we present below follows a simple design.

This simple design is a choice. First, having simple rules to generate transactions and fraudulent behaviors will help in interpreting the kind of patterns that different fraud detection techniques can identify. Second, while simple in its design, the data simulator will generate datasets that are challenging to deal with.

The simulated datasets will highlight most of the issues that practitioners of fraud detection face using real-world data. In particular, they will include class imbalance (less than 1% of fraudulent transactions), a mix of numerical and categorical features (with categorical features involving a very large number of values), non-trivial relationships between features, and time-dependent fraud scenarios."

I have a few doubts regarding data simulation:

Why are the customer as well as the terminal coordinates uniform? Shouldn't they be distributed on the basis of actual population density and be actual coordinates instead of being between 0 and 100? I think Europe can be fit in a rectangle and population densities can be used to get clustered data.

This point addresses the tradeoff between simplicity and realism. It could indeed be more realistic if the customer and terminal coordinates followed some real-world distributions. It is however unclear which distributions should be chosen (Europe, US, other, ?), and whether the added complexity would make the results more relevant in the context of this book.

Is there a sweet spot for n_customers to n_terminals ratio?

The radius r is set to 5, which corresponds to around 100 available terminals for each customer.

Is there a specific reason to use 100 terminals/customer?

This is a heuristic that we believe is reasonable from our experience on real-world datasets. Very few customers interact with less than ten terminals, and very few customers interact with more than 1000 terminals. Interactions with one hundred terminals on average seem reasonable.

Scenario 3: Every day, a list of 3 customers is drawn at random. In the next 14 days, 1/3 of their transactions have their amounts multiplied by 5 and marked as fraudulent.

Is there a specific reason to use 14 days? Or maybe base this number on real world data, because i doubt it will take 14 days for a customer to notice unwanted transactions on his card.

This is also a heuristic. It can take much more than 14 days for a customer to notice they were victims of fraud. Some customers actually never realize they were victims of fraud. Other customers check their bank accounts every day and notice fraudulent transactions quickly. Two weeks seem to us a reasonable average for customers to report potential fraudulent transactions.

It is worth noting that fraud scenarios also aim at balancing interpretability/simplicity/challenge/realism (Chapter 3, Section 2.5). For example, we emphasize that scenario 1 is not realistic, but aims at interpretability.

All in all, you are right that the fraud scenarios and the transaction simulator could be improved. We aim at improving it in the future, and are open to contributions on this topic.

Ashish0804 commented 3 years ago

Thanks for replying. The replies clear my doubts thus i am closing the issue. Also i'll try to see if i can make the simulator more realistic and compare how things go then. I'll also try to make a PR.

Fraud-Detection-Handbook / fraud-detection-handbook

[Doubts] Regarding data simulation #1