asterinas / trustflow-teeapps

TeeApps contain a general framework for developing TEE applications and various application implementations used in federated AI/BI.
Apache License 2.0
11 stars 2 forks source link

PSI组件手机号md5求交失败 #10

Open bronzeMe opened 4 months ago

bronzeMe commented 4 months ago

测试版本:0.2.0b2 (sim模式) 测试过程

启动CM

docker run -itd --name capsule-manager-sim-test-0509 --network=host secretflow/capsule-manager-sim-ubuntu22.04:latest bash
docker exec -it capsule-manager-sim-test-0509 bash 
nohup ./capsule_manager --enable-tls false --port 9119 > cm_run.log 2>&1 &

准备测试数据并加密

生成符合中国大陆手机号风格的随机号码

def generate_random_cnmobile(): prefixes = ['13', '14', '15', '17', '18'] prefix = random.choice(prefixes) suffix = ''.join([str(random.randint(0, 9)) for in range(9)])

print(prefix + suffix)

return prefix + suffix

计算字符串的MD5值

def calculate_md5(s): m = hashlib.md5() m.update(s.encode('utf-8')) return m.hexdigest()

生成伪造的CSV文件

def create_fake_csv(fake_csv, num_rows=100):

创建一个空的DataFrame

fake_data = pd.DataFrame(columns=['user_id'])

# 生成随机数据并计算MD5值
fake_data['user_id'] = [calculate_md5(generate_random_cn_mobile()) for _ in range(num_rows)]

# 保存新的DataFrame为CSV文件
fake_data.to_csv(fake_csv, index=False)

使用示例

fake_csv_path = 'fake_mobile_md5_100million.csv' # 替换为你想要保存的伪造CSV文件路径 create_fake_csv(fake_csv_path, 100000000) # 生成包含1亿条记录的CSV文件

* 数据样例
```csv
user_id
dfbb470150aac06cde764fe692bd8670
65c27372ee1c01fbe41d2e0c5bdd2454
429ac59f5c9aa0f83f7e27cdca06acc3
947f35d1ed1bd7d9f43146cc571c4d3b
b2e0e9e35851f1533121864cedf4931a
4448587dc933f0ad983030598bb42df9
bb373d783c6f6d080891047b78b6780a
63f7a44b8b5d5841db79db78333e813b
d56dd2d6cafbb13b797dde06431da402
1a039381d893d0e2b614cf8782bdd80e
加密数据

docker cp csvdata/fake_mobile_md5_100million.enc.csv teeapps-sim-020b0-test-0509:/host/testdata/breast_cancer docker cp psi_md5_sim.json teeapps-sim-020b0-test-0509:/host/integration_test/ docker cp carol.key teeapps-sim-020b0-test-0509:/host/integration_test/ docker cp carol.crt teeapps-sim-020b0-test-0509:/host/integration_test/

python convert.py --cert_path carol.crt --prikey_path carol.key --task_config_path psi_md5_sim.json --scope vfehnykt --capsule_manager_endpoint 11.163.85.163:9596 --tee_task_config_path psi_md5_task.json cd /home/teeapp/sim/teeapps ./main --plat=sim --enable_console_logger=true --enable_capsule_tls=false --entry_task_config_path=/host/integration_test/psi_md5_task.json

* psi_md5_sim.json
```json
{
  "sf_node_eval_param": {
    "domain": "preprocessing",
    "name": "psi",
    "version": "0.0.1",
    "attr_paths": [
      "input/input1/key",
      "input/input2/key"
    ],
    "attrs": [
      {
        "ss": [
          "user_id"
        ]
      },
      {
        "ss": [
          "user_id"
        ]
      }
    ],
    "inputs": [
      {
        "name": "input1",
        "type": "sf.table.individual",
        "meta": {
          "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
          "schema": {
            "ids": [
              "user_id"
            ],
            "features": [
            ],
            "id_types": [
              "str"
            ],
            "feature_types": [
            ]
          }
        },
        "data_refs": [
          {
            "uri": "file://input/?id=breast_cancer_alice&&uri=/host/testdata/breast_cancer/fake_mobile_md5_100million.enc.csv"
          }
        ]
      },
      {
        "name": "input2",
        "type": "sf.table.individual",
        "meta": {
          "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
          "schema": {
            "ids": [
              "user_id"
            ],
            "features": [
            ],
            "id_types": [
              "str"
            ],
            "feature_types": [
            ]
          }
        },
        "data_refs": [
          {
            "uri": "file://input/?id=breast_cancer_alice&&uri=/host/testdata/breast_cancer/fake_mobile_md5_100million.enc.csv"
          }
        ]
      }
    ],
    "output_uris": [
      "file://output/?id=psi_md5&&uri=/host/testdata/breast_cancer/md5_self_psi"
    ]
  }
}
teeapps (sim)运行日志
./main --plat=sim --enable_console_logger=true --enable_capsule_tls=false --entry_task_config_path=/host/integration_test/psi_md5_task.json
2024-05-09 02:13:23.012 [info] [log.cc:SetupLogger:81] Initialize logger app_log succeed.
2024-05-09 02:13:23.012 [info] [log.cc:SetupLogger:81] Initialize logger monitor_log succeed.
2024-05-09 02:13:23.012 [info] [app.cc:App:94] Start parsing Local Task Config...
2024-05-09 02:13:23.015 [info] [app.cc:App:100] Parsing Local Task Config succeed
2024-05-09 02:13:23.349 [info] [app.cc:App:125] Gen teeapps private key and certificate success
2024-05-09 02:13:23.349 [info] [app.cc:App:130] Create Capsule Manager Client success
2024-05-09 02:13:23.349 [info] [app.cc:PreProcess:273] Starting pre-processing, component preprocessing-psi-0.0.1...
2024-05-09 02:13:23.349 [info] [app.cc:GetInputDataKeys:212] Try to get Ra Cert from Capsule Manager
2024-05-09 02:13:23.354 [info] [app.cc:GetInputDataKeys:214] Got Ra Cert
2024-05-09 02:13:23.354 [info] [app.cc:GetInputDataKeys:215] Try to get data keys from Capsule Manager
2024-05-09 02:13:23.374 [info] [app.cc:GetInputDataKeys:222] Got data keys
2024-05-09 02:13:23.374 [info] [app.cc:ProcessInput:236] Downloading Individual Table Or Model/Rule and Decryption...
2024-05-09 02:13:25.619 [info] [app.cc:ProcessInput:257] Decrypting /host/testdata/breast_cancer/fake_mobile_md5_100million.enc.csv ...
2024-05-09 02:13:29.732 [info] [app.cc:ProcessInput:262] Decrypting /host/testdata/breast_cancer/fake_mobile_md5_100million.enc.csv success
2024-05-09 02:13:29.732 [info] [app.cc:ProcessInput:236] Downloading Individual Table Or Model/Rule and Decryption...
2024-05-09 02:13:32.037 [info] [app.cc:ProcessInput:257] Decrypting /host/testdata/breast_cancer/fake_mobile_md5_100million.enc.csv ...
2024-05-09 02:13:36.114 [info] [app.cc:ProcessInput:262] Decrypting /host/testdata/breast_cancer/fake_mobile_md5_100million.enc.csv success
2024-05-09 02:13:36.114 [info] [task_config_util.cc:GenAndDumpTaskConfig:657] Generate Individual table's schema
2024-05-09 02:13:36.114 [info] [task_config_util.cc:GenAndDumpTaskConfig:657] Generate Individual table's schema
2024-05-09 02:13:36.114 [info] [task_config_util.cc:FillTaskConfigParams:610] Try to fill psi config params
2024-05-09 02:13:36.114 [info] [task_config_util.cc:FillTaskConfigParams:613] Fill psi config params success
2024-05-09 02:13:36.114 [info] [task_config_util.cc:GenAndDumpTaskConfig:692] Dumping task config json succeed...
2024-05-09 02:13:36.114 [info] [app.cc:PreProcess:293] Pre-processing, component preprocessing-psi-0.0.1 succeed...
2024-05-09 02:13:36.114 [info] [app.cc:ExecCmd:299] Start executing, component preprocessing-psi-0.0.1...
2024-05-09 02:13:36.114 [info] [app.cc:ExecCmd:302] Launch command: /home/teeapp/python/bin/python3,/home/teeapp/sim/teeapps/biz/psi.py /home/teeapp/task/task_config.json

### 运行到这里 一直卡住了 
zhongtianq commented 4 months ago

感谢提问。数据量较大,我这边会跟进看一下这个组件的优化方式。

bronzeMe commented 3 months ago

另外,最近测试发现,两个相同的文件比如a.csv是2kw个手机号MD5,测试PSI组件的时候,输入1和输入2都是a.csv,预期结果应该也是2kw个手机号,但是测试发现,求交后结果会有重复的,大概会有1-2w个重复的手机号,也就是求交好的结果是2.2kw个手机号

zhongtianq commented 3 months ago

另外,最近测试发现,两个相同的文件比如a.csv是2kw个手机号MD5,测试PSI组件的时候,输入1和输入2都是a.csv,预期结果应该也是2kw个手机号,但是测试发现,求交后结果会有重复的,大概会有1-2w个重复的手机号,也就是求交好的结果是2.2kw个手机号

用你提供的脚本产生的手机号可能本身就是有重复,pandas求交就是求笛卡尔积。假设有n个重复的id_1,那么结果中就会出现n*n个id_1

bronzeMe commented 3 months ago

@zhongtianq 顺便请教一个问题,对于teeapps组件中的output_path = task_config.outputs[0].data_path,这个task_config.outputs[0].data_path是occlum视角下的内部路径么?还是host视角下的路径呢 有这样的一个场景:比如先生成了一个零时文件tmp.csv,可否直接创建一个软连接指向这个task_config.outputs[0].data_path,而不是拷贝tmp.csv到task_config.outputs[0].data_path,因为tmp.csv可能会比较大

zhongtianq commented 3 months ago

@zhongtianq 顺便请教一个问题,对于teeapps组件中的output_path = task_config.outputs[0].data_path,这个task_config.outputs[0].data_path是occlum视角下的内部路径么?还是host视角下的路径呢 有这样的一个场景:比如先生成了一个零时文件tmp.csv,可否直接创建一个软连接指向这个task_config.outputs[0].data_path,而不是拷贝tmp.csv到task_config.outputs[0].data_path,因为tmp.csv可能会比较大

没太明白你的问题,组件中生成的结果数据就是直接输出到指定的路径task_config.outputs[0].data_path,临时文件如果是在运行组件中有自定义逻辑生成,那这个软连接也应该在组件中自行链接。组件输出只会去查找task_config.outputs[0].data_path这个指定的路径。

bronzeMe commented 3 months ago

现在就是在自定义写组件的。 对于SPARK场景,SPARK的输出给它指定了一个outputpath,它会默认把这个outputpatch当作一个目录,然后把结果保存到这个路径下命名是动态的'part-xxxx.csv“,所以不能直接把task_config.outputs[0].data_path作为spark的输出结果, 就会有2种思路: 1.创建part-xxxx.csv到task_config.outputs[0].data_path的软连接,省去拷贝开销,这个就是上述的问题,

  1. 拷贝part-xxxx.csv到task_config.outputs[0].data_path
zhongtianq commented 3 months ago

我理解这个copy或者link的操作在组件中实现就可以了。 然后路径问题如果是在occclum环境中运行,通常配置/host开头,这样在occlum instance目录下可以看到输出,否则就在occlum内部文件系统中了