Closed callumparr closed 3 years ago
Sorry for you inconvenience. The error handling is still in progress.
xpore-diffmod will output all the positions that are differentially modified even slightly.
You are right that some positions of a gene don't pass the t-test, so they will not be in the output. The t-test option is just used to speed up xpore-diffmod, and of course not all differently modified positions are tested, resulting in lower sensitivity. In my paper, I don't use this option to produce the results, but of course the processing time will be longer because there are more positions to be modelled by xPore.
Regarding to your error, can I ask if you run on DNA? Because xPore uses prior information, called model_kmer, saying about the mean and std for each kmer, this information should be consistent with your kmer e.g. 5-mers or 6-mers. The default model_kmer file is stored internally in the package, but this is 5-mer information. In the next version, we will allow users to specify the path to their own model_kmer file in the config file.
Hey, thanks for the speedy reply.
So I should expect that the no. of hdf5 model files to equal the number of processed genes in the diffmod.log?
It is direct RNA data. It got to about half way through the data set then got this kmer with N.
I tried to resume but same issue. Would there be a way to track down which position and read is causing the issue so I can remove it from data set.
Yes, if you use --save_models, you should get the number of .hdf5 files equal to the number of processed genes.
xpore-dataprep doesn't check if there is "N" in the kmer. So, you can remove any invalid kmers from nanopolish output and rerun xpore-dataprep again. If you remove it from data.json, it would require data.index to be modified according, which is difficult. So, I suggest to modify the nanopolish output by removing the entire row containing invalid kmer.
Sounds good. Although with eventalign file the size of 300Gb that might take a while.
Here I searched on the reference k-mer in the eventalign.txt file, I think this is correct, and output the line it appears on if it does.
awk -F "\t" '{print $3}' /analysisdata/rawseq/bcl/callum/Mouse_aging/eventalign/Day3_06_DRS/Day3_06_pass_eventalign.txt | fgrep -n "CCGNC"
I cannot find any in the 4 input libraries prepared with data-prep for analysis with diffmod.
Also, nanopolish outputs "NNNNN" in the model k-mer if it thinks the signal is unexpected/weird. I guess this doesn't matter forxpore because there are many entries for this type.
xpore-dataprep already removes the "NNNNN" Kmers. But it's strange that you can't find "CCGNC".
So I know everything worked well when I just ran on one replicate for each condition and not using --genome flag so I will try dataprep step again this time without --genome but more replicates.
I noticed something weird when I ran salmon on some other library where the log output said about removing non ACTG sequences. I could try it on these libraries and see if get similar warning.
Running through qsub. Error happens and then job hangs without exiting. Similar to other issues which cause errors. No further genes processed for one day so killed it. Restarted with -resume but unsure if this also affected the model output. Should the number of models output be the same as the number of genes processed? Some genes don't pass the t-test so in this case they don't output a model?