Open erscott opened 10 years ago
Figure 2: TRAINING DATA A. Pipeline Schematic for masterVar formatting and RF/GBC training- python/ipynb script
B. The three feature sets we will benchmark are: A) All features
B) Read Depth, Allele Depth, derivatives of read depth and genotype info (variant type, zygosity, allelic imbalance LR)
C) GL, GQ, HQ scores from Complete Genomics
Figure 2: Machine Learning Filters compared A. Pipeline Schematic for masterVar formatting and RF training- python/ipynb script Join coverage data to each variant (average for multi-base variants)
B. Training results - out-of-bag scores; (feature importances, appendix table) Test results with NA12878 using different features sets Time to train model Size of model: uncompressed vs compressed complexity of models (20k variants, 100k variants, 1 million variants, 3.8 million variants)
C. Comparison by GCAT