Closed cwoehle closed 3 years ago
hm, is your model for adenine methylation still?
Hi al-mcintyre,
thank you for the quick reply. If I understood everything correctly it should be trained for adenine. testdata/test_positions.txt contains only m6A & A assignments.
For the example posted, I strictly followed the README with the test data:
mCaller.py -p testdata/test_positions.txt -r testdata/pb_ecoli_polished_assembly.fasta -e testdata/masonread1.eventalign.tsv -t 4 --train -f testdata/masonread1.fastq
The error was produced by the following command:
mCaller.py -p testdata/test_positions.txt -r testdata/pb_ecoli_polished_assembly.fasta -e testdata/masonread1.eventalign.tsv -d model_NN_6_m6A.pkl -f testdata/masonread1.fastq
A similar error occurred with a larger data set, but I thought it would be easier to stick to the test set here. I had some permission issues before as mCaller is installed globally on our server and the script wanted to write in the original installation directory, but changing this did not fix the problem.
Best wishes, Christian
Hi Christian,
Can you send me your model file? I'll take a look. My email is abm237 at cornell.edu if you don't want to post.
If you can edit the mCaller files, it's a simple fix in extract_contexts.py
: change line 130 twobase = True
to twobase = False
.
The problem was that I was training separate models for 'AG' contexts vs. 'AA', 'AC', and 'AT', but left the default training to combine everything, which led to this incompatibility. If you have 'AG' contexts you're interested in and do want to retrain a separate model (I found performance slightly improved), then alternatively you can change the boolean in line 133 to True
and leave line 130.
Hi al-mcintyre,
thank you for the helpful response. The training is finishing now and the results with the obtained model look much better than with the 'default' model. However, I get >100,000 methylated positions, while the expected motifs are only present in a couple of thousands of them. I`m concerned that the training set was not appropriate and lead to a lot of unspecific false positives. Or how would you interpret such a results?
Many thanks & Best wishes Christian
Hi Christian, There is in general an issue with false positives when predicting methylation at many sites. Depending on motif, the false positive rate can be quite low (e.g. 5%) but still translate to 0.05*2E6 predictions at unmethylated sites = 100,000 false positives for a 2M bp genome. Two things you could try are 1) adjusting the thresholds to reduce false positives (although you may lose true positives as well), and 2) taking the intersect of predicted sites with alternative predictors like Tombo.
Hi,
I´m trying to train my own model via mCaller. The methylation sites obtained after training look promising. However, when I try to apply the new model to a dataset I get an error.
Here is the full error message based on the test data provided following the commands described in the README:
Many thanks for any clarification! Christian