HazyResearch / flyingsquid

More interactive weak supervision with FlyingSquid
Apache License 2.0
314 stars 22 forks source link

Speeding up training time on large datasets with label dependencies #7

Closed dmitra79 closed 4 years ago

dmitra79 commented 4 years ago

Hello,

I tried training on 100K records with 9 weak labels: training takes 0.02 seconds without lambda_edges, but 7s with 1 edge, 18s with 2 edges and 21s with 3 lambda edges. Is this expected behavior? Are there ways to speed it up or parallelize? (I have multiple datasets with 47M rows, so assuming linear scaling in records, it'd take almost 3h for training on each...)

Thank you!

DanFu09 commented 4 years ago

Hi, great question! Our code path for training without label dependencies is much more optimized than training with label dependencies (the moments are easier to calculate), so that's why you're seeing the runtime gap, and there isn't a simple way to parallelize training in the current implementation.

Have you evaluated the performance (in terms of accuracy or F1) of the label model with and without label dependencies? We've often found that performance is often still ok without dependencies in practice, especially when you don't have too many labeling functions (like 9).

dmitra79 commented 4 years ago

Thank you for the reply! Unfortunately we have few ground truth labeled to evaluate the performance, but we'll try it.