Exodus-Privacy / exodus-core

Core functionality of εxodus
GNU Affero General Public License v3.0
18 stars 17 forks source link

Increase Specificity of dexdump Regex #9

Open peterstory opened 5 years ago

peterstory commented 5 years ago

I'm not sure if it will make a practical difference on which trackers are detected, but I noticed that the regex here seems unnecessary broad, and will match on things other than class names: https://github.com/Exodus-Privacy/exodus-core/blob/685bab04fd44d3c0634e8a2448aad2e464a9b5aa/exodus_core/analysis/static_analysis.py#L147-L148

Here is a short example of what I mean. Here are the first 20 lines from running dexdump on WhatsApp:

dexdump work_dir/tmp_dir/classes.dex | head -n 20
Processing 'work_dir/tmp_dir/classes.dex'...
Opened 'work_dir/tmp_dir/classes.dex', DEX version '035'
Class #0            -
  Class descriptor  : 'La/a/a/a/a/a$a;'
  Access flags      : 0x0011 (PUBLIC FINAL)
  Superclass        : 'Ljava/lang/Object;'
  Interfaces        -
  Static fields     -
  Instance fields   -
    #0              : (in La/a/a/a/a/a$a;)
      name          : 'a'
      type          : 'Ljava/lang/String;'
      access        : 0x0001 (PUBLIC)
    #1              : (in La/a/a/a/a/a$a;)
      name          : 'b'
      type          : 'Ljava/lang/String;'
      access        : 0x0001 (PUBLIC)
    #2              : (in La/a/a/a/a/a$a;)
      name          : 'c'
      type          : 'Ljava/lang/String;'

And here is the result of running the regex which the code is currently using. Note the matching on the superclass and on instance field types.

dexdump work_dir/tmp_dir/classes.dex | head -n 20 | perl -n -e'/[A-Z]+((?:\w+\/)+\w+)/ && print "$1\n"' | sort | uniq
a/a/a/a/a/a
java/lang/Object
java/lang/String

I've written a slightly different regex which is more specific, so it will only match on the Class descriptor:

dexdump work_dir/tmp_dir/classes.dex | head -n 20 | perl -n -e'/\s*Class descriptor\s*:\s*\047L((?:\w+\/)+\w+)\W/ && print "$1\n"'  | sort | uniq
a/a/a/a/a/a

I would be curious to see whether this change has any downstream effects (i.e., on what trackers are detected). If this change looks good, I can include it in a PR, potentially addressing #7 at the same time.

EDIT: Fixed my regex. It didn't account for some class descriptors not including a $ or starting with a variable number of whitespace characters.

peterstory commented 5 years ago

Although I haven't rigorously benchmarked the effect of this change, there appears to be a significant performance improvement from having a more specific regex (probably because there are fewer lines to sort). My unit test which includes the subprocesses described above went from about 3.4 seconds to 2.5 seconds.

counter-reverse commented 3 years ago

I am already replacing the regex by python code. See my PR: https://github.com/Exodus-Privacy/exodus-core/pull/35