Training and using a new model fails

cwoehle commented 4 years ago

Hi,

I´m trying to train my own model via mCaller. The methylation sites obtained after training look promising. However, when I try to apply the new model to a dataset I get an error.

Here is the full error message based on the test data provided following the commands described in the README:

testdata/ 1 contigs 1 threads 26dd376e-9d82-41fc-921e-71e559c8e8d1_Basecall_2D_template 13673 CATTAMCAATC 1.1,-0.23666666666666666,0.28,1.17,-0.02,0.6699999999999999,7.055265349382997 - - Index or Key Error ['general'] ['AA', 'AC', 'AG', 'AM', 'AT', 'MG', 'MA', 'MC', 'MM', 'MH', 'MT'] MC 'MH' Traceback (most recent call last): File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 188, in main() File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 185, in main args.train,modelfile,args.skip_thresh,args.qual_thresh,args.classifier,training_tsv,args.plot_training) File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 59, in distribute_threads extract_features(tsvname,refname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,0,endline=bytesize,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list) File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/extract_contexts.py", line 218, in extract_features print model[twobase_model].predict_proba([diffs]) KeyError: 'MH'

Many thanks for any clarification! Christian

al-mcintyre commented 4 years ago

hm, is your model for adenine methylation still?

cwoehle commented 4 years ago

Hi al-mcintyre,

thank you for the quick reply. If I understood everything correctly it should be trained for adenine. testdata/test_positions.txt contains only m6A & A assignments.

For the example posted, I strictly followed the README with the test data: mCaller.py -p testdata/test_positions.txt -r testdata/pb_ecoli_polished_assembly.fasta -e testdata/masonread1.eventalign.tsv -t 4 --train -f testdata/masonread1.fastq The error was produced by the following command: mCaller.py -p testdata/test_positions.txt -r testdata/pb_ecoli_polished_assembly.fasta -e testdata/masonread1.eventalign.tsv -d model_NN_6_m6A.pkl -f testdata/masonread1.fastq

A similar error occurred with a larger data set, but I thought it would be easier to stick to the test set here. I had some permission issues before as mCaller is installed globally on our server and the script wanted to write in the original installation directory, but changing this did not fix the problem.

Best wishes, Christian

al-mcintyre commented 4 years ago

Hi Christian,

Can you send me your model file? I'll take a look. My email is abm237 at cornell.edu if you don't want to post.

al-mcintyre commented 4 years ago

If you can edit the mCaller files, it's a simple fix in extract_contexts.py: change line 130 twobase = True to twobase = False. The problem was that I was training separate models for 'AG' contexts vs. 'AA', 'AC', and 'AT', but left the default training to combine everything, which led to this incompatibility. If you have 'AG' contexts you're interested in and do want to retrain a separate model (I found performance slightly improved), then alternatively you can change the boolean in line 133 to True and leave line 130.

cwoehle commented 4 years ago

Hi al-mcintyre,

thank you for the helpful response. The training is finishing now and the results with the obtained model look much better than with the 'default' model. However, I get >100,000 methylated positions, while the expected motifs are only present in a couple of thousands of them. I`m concerned that the training set was not appropriate and lead to a lot of unspecific false positives. Or how would you interpret such a results?

Many thanks & Best wishes Christian

al-mcintyre commented 4 years ago

Hi Christian, There is in general an issue with false positives when predicting methylation at many sites. Depending on motif, the false positive rate can be quite low (e.g. 5%) but still translate to 0.05*2E6 predictions at unmethylated sites = 100,000 false positives for a 2M bp genome. Two things you could try are 1) adjusting the thresholds to reduce false positives (although you may lose true positives as well), and 2) taking the intersect of predicted sites with alternative predictors like Tombo.

al-mcintyre / mCaller

Training and using a new model fails #15