Training a long-read model as if the alignments were short reads

jelber2 commented 5 months ago

Dear Brendan,

I know that you were considering something else in terms of making a long-read model, but I wanted to try and make a model and classifier based on these CRAM files, after shredding to 150-bp alignments pieces

, but I am a little confused as https://github.com/brendanofallon/jovian/blob/0fe88c574aa50a4d5f5debb3cd5345219643126f/README.md?plain=1#L99-L101

seems to suggest that the fourth column of the bed file for the config.yaml (from GIAB for example) would be subdivided into the value classes one sees below

https://github.com/brendanofallon/jovian/blob/0fe88c574aa50a4d5f5debb3cd5345219643126f/README.md?plain=1#L83-L87

Am I correct that such a bed file would be subdivided based on the vals_per_class? I would really be interested in seeing how well a long-read model (even a "shredded" one) for jenever performs on Nanopore SUP_Herro data that I had sent you links for.

brendanofallon commented 5 months ago

Yes, I think you're mostly correct. Each BED file region to be used for training should have a label, which is arbitrary (they can even all be the same). The region labels are useful so that you can specify upsampling / downsampling during training data generation ("pregen"). For instance, in most data sets SNPs will be far more common than insertions or deletions, but maybe for training you'd like to sample them in roughly equal proportions. The labels / 'vals_per_class' scheme makes this easy, just set the values for 'snv', 'insertion', and 'deletion' classes to be the same.

If there are more matching rows in the BED file than the "vals_per_class" entry, then jenever will randomly downsample those rows. If there are fewer matching rows in the BED than "vals_per_class", then jenever will cycle over the rows to ensure equal sampling.

Is this helpful? Happy to explain more if needed

jelber2 commented 5 months ago

Yes, this is very helpful. I will give it a try and perhaps report back in a separate issue on how a model with HG002-HG004 with these Nanopore SUP_Herro reads performs when leaving out chr22 for example from the training data.

jelber2 commented 5 months ago

Reopening this as I was having issues with buildclf.py to train a classifier.

It seems that one needs

train a model (in progress)
run jenever with that model and the classifier from https://github.com/brendanofallon/jovian/blob/0fe88c574aa50a4d5f5debb3cd5345219643126f/models/s28ce40_bamfix.model
run hap.py to get the TP and FPs (possibly with steps 2 and 3 with more than one sample)
make a new classifier with that TP and FP information

brendanofallon commented 5 months ago

That's about right, but for step 2 you should run without any classifier - it's not required, and the quality scores will default to the exponentiated logits associated with the tokens containing the variants. Since those 'raw' quality scores are one of the features the classifier model uses its important that they are in fact the 'raw' qualities (from the logits) and not produced by a different classifier. For step 3 hap.py is fine, but a little slow. vcfeval is faster, but remember to run with --ref-overlap and --all-records options since vcfeval will ignore variant records with non-PASS filter fields by default Sadly, a bug crept into the training logic of the buildclf.py and it will produce a non-functional model. I'm working on a big update that offers much faster performance (almost 2X!) for calling and a bunch of other minor improvements. I'll include the buildclf fix in that. In the meantine if you're in a hurry you can export features and labels from buildclf and train an sklearn random forest model from those. That's typically what I do so I can evaluate performance metrics of the classifier more easily before doing actual calling with it

jelber2 commented 5 months ago

Ok, thank you very much. I will first try improving my calling model then think about the classifier.

jelber2 commented 5 months ago

Would you recommend a random forest over the classifier in terms of recall and precision? Is the classifier merely for convenience to the jenever user?

brendanofallon commented 5 months ago

Random forests seem to work very well as the classifier model, and that's what I've typically used. I've also tried gradient-boosted trees (XGBoost), but the performance seemed pretty close to random forests. In general random forests are hard to beat for low-dimensional tabular data, which is what this is. Keep in mind that you'll probably want thousands of false positive (FP) calls for training, which can be hard to get since jenever runs pretty slowly. Feel free to use the new branch 'better_parallel_call', which is much faster. The code has been significantly refactored so many files will move around... but it's the future anyway

jelber2 commented 4 months ago

Ok, so I made a mistake in the pregen steps where I used your scripts/splitbed.py script for the validation data, but not the training data and was getting quite a few error in the logs for the training data. Now that I fixed that issue and remembered to use the splitbed.py, I think over the weekend, I should have some better results for the pregen data for training.

That might be why ppv (positive predictive value?) was slow low (below). Can you clarify the abbreviations ppa and ppv?

train  INFO  Epoch 0 Secs: 4399.76 lr: 0.00001 loss: 577.8231 val acc: 0.992 / 0.982  ppa: 0.000 / 0.041 / 0.033  ppv: 0.000 / 0.148 / 0.161 swaps: 7

brendanofallon commented 4 months ago

You're right that PPV stands for positive predictive value - the fraction of all calls that match a true variant (from the genome-in-a-bottle data). PPA is positive pairwise agreement - just a synonym for sensitivity or recall - the fraction of true variants that are detected by the algorithm Your data from above is for epoch 0 - are you fine tuning the 100M_s28... model, or starting from scratch? It can take a pretty long time to get high PPA / PPV (especially snp PPV). Can I ask how much training data you have?

jelber2 commented 4 months ago

The training data is only hg001-hg004 (the CRAM files that shared with you previously) for chr1-21 (chr22 left out for validation). I am indeed fine tuning the 100M_s28...model . EdIT As I had said, I made an error in not using the splitbed.py script, which I had used for calculating the number of snvs for the pregen config.yaml, but I messed up and used the files before running splitbed.py as the input for pregen

jelber2 commented 4 months ago

After 10,000,000 samples with starting about 42977 training files (increased to about 60000 since pregen is still running)

[04-23 11:18:45] 905940  train  INFO  Epoch 0 Secs: 94354.05 lr: 0.00005 loss: 1557.0583 val acc: 0.994 / 0.987  ppa: 0.054 / 0.241 / 0.324  ppv: 0.612 / 0.422 / 0.476 swaps: 16

Sadly, I only have access to a single A100 40GB card on a workstation for training as our GPU cluster uses an ancient Nvidia driver (470!) that is not compatible somehow with the training (had to use pip install torch==2.0.0), but is ok for the inference steps.

I trained and evaluated a random forest model in JuliaLang, but that was not for Jenever variants. At least I know how to do that in principle now. After I guess a few epochs are trained. I am continuing to train from the checkpoint that I show values above from.

Results from hap.py on bases 16-17million on chr22 (entire chr22 held out from training) (thanks for the tip about VCFeval, as it runs like you said a lot faster, I just like the output from hap.py for illustrating things)

Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL          109        33        76          573       186        353     15    129       0.302752          0.154545        0.616056         0.204633                     NaN                     NaN                   1.523810                  15.171429
INDEL   PASS          109        24        85          125        33         67      4     26       0.220183          0.431034        0.536000         0.291474                     NaN                     NaN                   1.523810                   4.636364
  SNP    ALL         1278       106      1172          924       289        529     95     96       0.082942          0.268354        0.572511         0.126718                1.846325                1.713873                   2.661891                   2.145833
  SNP   PASS         1278        52      1226          240        70        118     46      8       0.040689          0.426230        0.491667         0.074286                1.846325                3.189655                   2.661891                   0.394118

jelber2 commented 2 months ago

Just updating this that after quite a while using a much larger 10Mbp region of chr22 (held out from training), I am getting

 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         1555      1008       547         6159       982       4152    109    680       0.648232          0.510713        0.674135         0.571313                     NaN                     NaN                   2.210526                   4.865000
  SNP    ALL        10757      9226      1531        24290      1644      13419    389    464       0.857674          0.848772        0.552450         0.853200                2.395708                1.968031                   2.326114                   1.998008

It took me quite a while to modify the code for pytorch-2.0.0 and get it running on our cluster's ancient CUDA 470 drivers. I'll keep training until I hope I get closer to 0.9 F1 scores for both indels and SNPs.

jelber2 commented 4 days ago

Just an update... I gave up on the error-corrected Nanopore read model as I think I had too low of coverage on average (closer to 20x than say 40-50x) and tried using just raw Nanopore reads at about 50x coverage and never really got high F1 scores above 0.2 and realized that it might have something to do with k-mer accuracies as I generally find raw Nanopore reads (even using the latest base-calling models) do not perform well with k-mer based analyses unless you error-correct the reads. Feel free to close this issue after you read this.

brendanofallon commented 4 days ago

Hi Jean, I've been working on some updates for using ONT data over the last few weeks. There are a few different issues that are affecting both runtime performance and accuracy. I sorted out some of them and am generating training data from the four GIAB samples that were available publicly, but I'm not sure that'll be enough. (For a BWA fine-tune of the original model, we used about 20 samples.) I'll keep plugging away, but there are definitely some big code changes that need to occur to make things work for ONT.

jelber2 commented 4 days ago

Oh, that sounds exciting about seeing how nanopore might work with jenever. Looking forward to updates.

brendanofallon / jovian

Training a long-read model as if the alignments were short reads #42