Closed lamz138138 closed 9 years ago
Hi,
ANGEL does things a little differently from ANGLE. In ANGLE, the training dataset comes from reference annotations (ex: RefSeq, Gencode) which is not always available.
So instead, ANGEL offers two ways to training data: by offering an external reference (and you just run angel_train.py on it); or by using the data itself as training (hence, dumb_predict.py).
The idea is that even though PacBio Iso-Seq output may have some errors, a lot of them will be close to or 100% accurate in the CDS region, so there should be sufficient features you can learn from it. Hence, dumb_predict.py implements a error-less ORF prediction + scoring scheme (same scheme used by Transdecoder).
In many cases, the output from dumb_predict.py might be the actual ORF because there are no errors in the CDS region. But to save the ones that have no or a broken ORF because of error, we run ANGEL.
angel_train.py takes both the UTR & CDS from dumb_predict.py to train features for coding regions. Then angel_predict.py will use those features to look for ORFs.
Hope this makes sense.
Hi, Magdoll, thank you for your reply!
If we use dumb_predict.py, we assume all sequences are protein-coding, but there are many non-coding sequence, feature generated may be not true feature, so I think these should use with careful.
Thanks again! Best wishes!
Hi, Magdoll!
I have another question about pbtranscript tofu, but I can't find how to setup issue there, so I post my question here.
I had installed pbtranscript tofu, but when I run collapse_isoforms_by_sam.py, it output error, I failed to get an answer, how should I do?
I installed "/opt/smrtanalysis_2.3.0.140936", then "cd /opt" and "ln -s smrtanalysis_2.3.0.140936 smrtanalysis". Then, export VENV_TOFU=/opt/zhongxm/VENV_TOFU . When "ls /opt/zhongxm/VENV_TOFU/lib/python2.7/site-packages/pbtools.pbtranscript-0.3.tofu.20150120-py2.7-linux-x86_64.egg/pbtools/pbtranscript/branch/", there no director "C". But when "ls /opt/zhongxm/cDNA_primer/pbtranscript-tofu/pbtools/pbtranscript/branch/", there is "C", so whether the problem was caused by no "C" in first path? I had tried "ln -s C " to VENV—TOFU, but it didn't work. And I install cDNA-primer with "Finished processing dependencies for pbtools.pbtranscript==0.3.tofu.20150120".
Following is error and steps I setup pbtranscript tofu.
The erro:
Traceback (most recent call last):
File "/opt/zhongxm/VENV_TOFU/bin/collapse_isoforms_by_sam.py", line 5, in
Install pbtranscript tofu:
/opt/smrtanalysis/smrtcmds/bin/smrtshell wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.6.tar.gz tar zxf virtualenv-1.11.6.tar.gz -C /tmp/ export VENV_TOFU=/opt/zhongxm/VENV_TOFU python /tmp/virtualenv-1.11.6/virtualenv.py --system-site-packages -p /opt/smrtanalysis/current/redist/python2.7/bin/python $VENV_TOFU source $VENV_TOFU/bin/activate download cDNA_primer-master.zip unzip cDNA_primer-master.zip ln -s cDNA_primer-master cDNA_primer cd cDNA_primer/pbtranscript-tofu make
Thanks for your help! Best wishes!
Hi lamz138138,
Good point about possibly contaminated with ncRNA. From what I understand, one of the major definitions of ncRNA is that you cannot get a long (> 100 aa, say) ORF. So it is very unlikely you will get any ORF predictions from ncRNA and unlikely they will skew the feature training.
--Liz (Magdoll)
Hi lamz138138,
For the other error, can you please open a bug under cDNA_primer? (github.com/PacificBiosciences/cDNA_primer) I think I know where the problem is and may be able to solve it.
Hi, Liz!
Thank you for the reply! I can't find bug in cDNA_primer, so I put my question in commit.
Best wishes!
Hi Liz,
I found an error when I run collapse_isoforms_by_sam.py, see below:
Traceback (most recent call last):
File "/global/scratch2/sd/jianpeng/PacBio/cDNA_primer/bin/bin/collapse_isoforms_by_sam.py", line 5, in
My command to run it is: collapse_isoforms_by_sam.py --input all_quivered_hq.100_30_0.99.sorted.fastq --fq -s all_quivered_hq.100_30_0.99.sorted.sam -o final_output
Can you tell me what is going on? How to fix this?
Thanks,
Jack
Hi,
Sorry for the late reponse. I was away for vacation and just got back.
It looks like this might be an older version of code...can you confirm:
(1) that this is the latest TOFU (last updated 2 weeks ago) (2) can you please confirm that the first 4 lines in pbtranscript-tofu/pbtranscript/pbtools/pbtranscript/branch/C/c_branch.pyx is:
import numpy as np cimport numpy as np from cpython cimport bool from pbtools.pbtranscript.modified_bx_intervals.intersection_unique import IntervalTreeUnique, Interval
Hi Liz,
I'm currently using Angel to predict ORFs. I use cds and utr sequences annotated by refseq. However I have a doubt. If I am right, the angel_make_training_set script removes redundancy in CDS sequences and takes 500 sequences as training data. So, is this step neccesary if I have refseq .utr and .cds files? I think it would be neccesary to make faster the classifer training step but I'm not sure. By the way, as more sequences you have in your training data better your prediction will be?
Thanks in advance
Lorena
Hi Lorena,
If the refseq UTR and CDS are already non-redundant, then you do NOT need to run make training set! You can directly use angel_train.py.
More sequences (or more precisely, diversity) should result in a more robust model. But you also do not want to take forever to train. The current training code is pretty slow, so I use either 500 sequences, or if it looks really too slow, I go down to 250 and the results are still decent for prediction.
Hi Liz,
Thanks for your help!!
Best wishes,
Lorena
Hi Liz,
I have a question for IsoSeq rRNA contamination. I have the read of insert FASTA file.
If I want to detect how many reads can be mapped to rRNA, what aligner should I use to do the alignment? Do you think I can use BLASR? Do you have any suggestion on this?
Thanks,
Jianpeng
On Tue, Mar 3, 2015 at 9:11 AM, Lodela89 notifications@github.com wrote:
Hi Liz,
Thanks for your help!!
Best wishes,
Lorena
— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-76988640 .
Hi Liz,
I have a question for IsoSeq rRNA contamination. I have the read of insert FASTA file.
If I want to detect how many reads can be mapped to rRNA, what aligner should I use to do the alignment? Do you think I can use BLASR? Do you have any suggestion on this?
Thanks,
On Mon, Mar 9, 2015 at 4:25 PM, Jianpeng Xu jxu006@gmail.com wrote:
Hi Liz,
I have a question for IsoSeq rRNA contamination. I have the read of insert FASTA file.
If I want to detect how many reads can be mapped to rRNA, what aligner should I use to do the alignment? Do you think I can use BLASR? Do you have any suggestion on this?
Thanks,
Jianpeng
On Tue, Mar 3, 2015 at 9:11 AM, Lodela89 notifications@github.com wrote:
Hi Liz,
Thanks for your help!!
Best wishes,
Lorena
— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-76988640 .
Hi Jianpeng,
You can use BLASR to align to rRNA.
Using the official cDNA protocol, you should have little to no rRNA contamination.
Also, I believe I have opened up "issue" in the cDNA wiki (https://github.com/PacificBiosciences/cDNA_primer/issues). For future issues related to Iso-Seq, please try to use that instead!
Thanks, --Liz
Thanks, Liz. I will use cDNA_primer in the future.
My Iso-Seq data is from fungal. Which rRNA data should I align my reads to? Should I align my Iso-Seq reads of insert FASTA file to a rRNA database?
I collected a rRNA database and it includes 115k rRNA FASTA sequences from different species. Can I align my reads to this rRNA database?
Thanks again,
Jianpeng
On Mon, Mar 9, 2015 at 4:38 PM, Magdoll notifications@github.com wrote:
Hi Jianpeng,
You can use BLASR to align to rRNA.
Using the official cDNA protocol, you should have little to no rRNA contamination.
See here:
Also, I believe I have opened up "issue" in the cDNA wiki ( https://github.com/PacificBiosciences/cDNA_primer/issues). For future issues related to Iso-Seq, please try to use that instead!
Thanks, --Liz
— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-77967832 .
Hi Liz,
I download the pbtranscript-tofu not very long time ago. But in the directory /cDNA_primer/pbtranscript-tofu, there are 3 folders: _pbtranscript, _pbtranscript_20150106_forYli, _pbtranscriptold
Which one should I use?
Thanks,
Jianpeng
On Mon, Mar 2, 2015 at 12:18 PM, Magdoll notifications@github.com wrote:
Hi,
Sorry for the late reponse. I was away for vacation and just got back.
It looks like this might be an older version of code...can you confirm:
(1) that this is the latest TOFU (last updated 2 weeks ago) (2) can you please confirm that the first 4 lines in pbtranscript-tofu/pbtranscript/pbtools/pbtranscript/branch/C/c_branch.pyx is:
import numpy as np cimport numpy as np from cpython cimport bool from pbtools.pbtranscript.modified_bx_intervals.intersection_unique import IntervalTreeUnique, Interval
— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-76807113 .
Hi,
PLease use pbtranscript/. The other two should really not be there..they are OLD archives. My bad :)
For fungal rRNA --- I don't know how different rRNA from different species are. If you have exactly the same species rRNA, use just that. Otherwise you certainly can use all of them, just remember there will be some false positive hits.
Hi!
I had learned the pipeline of ANGEL, however, I felt confused in some steps, following is question: 1) In the paper of ANGLE it need no error data for classifier and error-data in getting parameters for Markov chains, but there are only one train steps in ANGEL, didn't this because ANGEL use only one train data sets? 2) ANGEL produce dumb ORF first, then create non-redundant train data sets, did these means all sequences are protein coding sequence, but there are many non-coding sequences in genome, so are these two steps reasonable?
3) If the answer of 2) is not reasonable, can we use utr and cds sequences annotated by refseq as input in angel_train.py? Then, does these mean dump ORF prediction is useless?
Thanks for any suggestion!
Best wishes!