Closed jelber2 closed 4 days ago
Yes, I think you're mostly correct. Each BED file region to be used for training should have a label, which is arbitrary (they can even all be the same). The region labels are useful so that you can specify upsampling / downsampling during training data generation ("pregen"). For instance, in most data sets SNPs will be far more common than insertions or deletions, but maybe for training you'd like to sample them in roughly equal proportions. The labels / 'vals_per_class' scheme makes this easy, just set the values for 'snv', 'insertion', and 'deletion' classes to be the same.
If there are more matching rows in the BED file than the "vals_per_class" entry, then jenever will randomly downsample those rows. If there are fewer matching rows in the BED than "vals_per_class", then jenever will cycle over the rows to ensure equal sampling.
Is this helpful? Happy to explain more if needed
Yes, this is very helpful. I will give it a try and perhaps report back in a separate issue on how a model with HG002-HG004 with these Nanopore SUP_Herro reads performs when leaving out chr22 for example from the training data.
Reopening this as I was having issues with buildclf.py to train a classifier.
It seems that one needs
That's about right, but for step 2 you should run without any classifier - it's not required, and the quality scores will default to the exponentiated logits associated with the tokens containing the variants. Since those 'raw' quality scores are one of the features the classifier model uses its important that they are in fact the 'raw' qualities (from the logits) and not produced by a different classifier. For step 3 hap.py is fine, but a little slow. vcfeval is faster, but remember to run with --ref-overlap and --all-records options since vcfeval will ignore variant records with non-PASS filter fields by default Sadly, a bug crept into the training logic of the buildclf.py and it will produce a non-functional model. I'm working on a big update that offers much faster performance (almost 2X!) for calling and a bunch of other minor improvements. I'll include the buildclf fix in that. In the meantine if you're in a hurry you can export features and labels from buildclf and train an sklearn random forest model from those. That's typically what I do so I can evaluate performance metrics of the classifier more easily before doing actual calling with it
Ok, thank you very much. I will first try improving my calling model then think about the classifier.
Would you recommend a random forest over the classifier in terms of recall and precision? Is the classifier merely for convenience to the jenever user?
Random forests seem to work very well as the classifier model, and that's what I've typically used. I've also tried gradient-boosted trees (XGBoost), but the performance seemed pretty close to random forests. In general random forests are hard to beat for low-dimensional tabular data, which is what this is. Keep in mind that you'll probably want thousands of false positive (FP) calls for training, which can be hard to get since jenever runs pretty slowly. Feel free to use the new branch 'better_parallel_call', which is much faster. The code has been significantly refactored so many files will move around... but it's the future anyway
Ok, so I made a mistake in the pregen steps where I used your scripts/splitbed.py
script for the validation data, but not the training data and was getting quite a few error in the logs for the training data. Now that I fixed that issue and remembered to use the splitbed.py, I think over the weekend, I should have some better results for the pregen data for training.
That might be why ppv (positive predictive value?) was slow low (below). Can you clarify the abbreviations ppa and ppv?
train INFO Epoch 0 Secs: 4399.76 lr: 0.00001 loss: 577.8231 val acc: 0.992 / 0.982 ppa: 0.000 / 0.041 / 0.033 ppv: 0.000 / 0.148 / 0.161 swaps: 7
You're right that PPV stands for positive predictive value - the fraction of all calls that match a true variant (from the genome-in-a-bottle data). PPA is positive pairwise agreement - just a synonym for sensitivity or recall - the fraction of true variants that are detected by the algorithm Your data from above is for epoch 0 - are you fine tuning the 100M_s28... model, or starting from scratch? It can take a pretty long time to get high PPA / PPV (especially snp PPV). Can I ask how much training data you have?
The training data is only hg001-hg004 (the CRAM files that shared with you previously) for chr1-21 (chr22 left out for validation). I am indeed fine tuning the 100M_s28...model . EdIT As I had said, I made an error in not using the splitbed.py script, which I had used for calculating the number of snvs for the pregen config.yaml, but I messed up and used the files before running splitbed.py as the input for pregen
After 10,000,000 samples with starting about 42977 training files (increased to about 60000 since pregen is still running)
[04-23 11:18:45] 905940 train INFO Epoch 0 Secs: 94354.05 lr: 0.00005 loss: 1557.0583 val acc: 0.994 / 0.987 ppa: 0.054 / 0.241 / 0.324 ppv: 0.612 / 0.422 / 0.476 swaps: 16
Sadly, I only have access to a single A100 40GB card on a workstation for training as our GPU cluster uses an ancient Nvidia driver (470!) that is not compatible somehow with the training (had to use pip install torch==2.0.0), but is ok for the inference steps.
I trained and evaluated a random forest model in JuliaLang, but that was not for Jenever variants. At least I know how to do that in principle now. After I guess a few epochs are trained. I am continuing to train from the checkpoint that I show values above from.
Results from hap.py on bases 16-17million on chr22 (entire chr22 held out from training) (thanks for the tip about VCFeval, as it runs like you said a lot faster, I just like the output from hap.py for illustrating things)
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 109 33 76 573 186 353 15 129 0.302752 0.154545 0.616056 0.204633 NaN NaN 1.523810 15.171429
INDEL PASS 109 24 85 125 33 67 4 26 0.220183 0.431034 0.536000 0.291474 NaN NaN 1.523810 4.636364
SNP ALL 1278 106 1172 924 289 529 95 96 0.082942 0.268354 0.572511 0.126718 1.846325 1.713873 2.661891 2.145833
SNP PASS 1278 52 1226 240 70 118 46 8 0.040689 0.426230 0.491667 0.074286 1.846325 3.189655 2.661891 0.394118
Just updating this that after quite a while using a much larger 10Mbp region of chr22 (held out from training), I am getting
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 1555 1008 547 6159 982 4152 109 680 0.648232 0.510713 0.674135 0.571313 NaN NaN 2.210526 4.865000
SNP ALL 10757 9226 1531 24290 1644 13419 389 464 0.857674 0.848772 0.552450 0.853200 2.395708 1.968031 2.326114 1.998008
It took me quite a while to modify the code for pytorch-2.0.0 and get it running on our cluster's ancient CUDA 470 drivers. I'll keep training until I hope I get closer to 0.9 F1 scores for both indels and SNPs.
Just an update... I gave up on the error-corrected Nanopore read model as I think I had too low of coverage on average (closer to 20x than say 40-50x) and tried using just raw Nanopore reads at about 50x coverage and never really got high F1 scores above 0.2 and realized that it might have something to do with k-mer accuracies as I generally find raw Nanopore reads (even using the latest base-calling models) do not perform well with k-mer based analyses unless you error-correct the reads. Feel free to close this issue after you read this.
Hi Jean, I've been working on some updates for using ONT data over the last few weeks. There are a few different issues that are affecting both runtime performance and accuracy. I sorted out some of them and am generating training data from the four GIAB samples that were available publicly, but I'm not sure that'll be enough. (For a BWA fine-tune of the original model, we used about 20 samples.) I'll keep plugging away, but there are definitely some big code changes that need to occur to make things work for ONT.
Oh, that sounds exciting about seeing how nanopore might work with jenever. Looking forward to updates.
Dear Brendan,
I know that you were considering something else in terms of making a long-read model, but I wanted to try and make a model and classifier based on these CRAM files, after shredding to 150-bp alignments pieces
, but I am a little confused as https://github.com/brendanofallon/jovian/blob/0fe88c574aa50a4d5f5debb3cd5345219643126f/README.md?plain=1#L99-L101
seems to suggest that the fourth column of the bed file for the config.yaml (from GIAB for example) would be subdivided into the value classes one sees below
https://github.com/brendanofallon/jovian/blob/0fe88c574aa50a4d5f5debb3cd5345219643126f/README.md?plain=1#L83-L87
Am I correct that such a bed file would be subdivided based on the vals_per_class? I would really be interested in seeing how well a long-read model (even a "shredded" one) for jenever performs on Nanopore SUP_Herro data that I had sent you links for.