PacificBiosciences / ANGEL

Robust Open Reading Frame prediction (ANGLE re-implementation)
Other
16 stars 14 forks source link

Questions about ANGEL, thanks! #2

Closed lamz138138 closed 9 years ago

lamz138138 commented 9 years ago

Hi!

I had learned the pipeline of ANGEL, however, I felt confused in some steps, following is question: 1) In the paper of ANGLE it need no error data for classifier and error-data in getting parameters for Markov chains, but there are only one train steps in ANGEL, didn't this because ANGEL use only one train data sets? 2) ANGEL produce dumb ORF first, then create non-redundant train data sets, did these means all sequences are protein coding sequence, but there are many non-coding sequences in genome, so are these two steps reasonable?
3) If the answer of 2) is not reasonable, can we use utr and cds sequences annotated by refseq as input in angel_train.py? Then, does these mean dump ORF prediction is useless?

Thanks for any suggestion!

Best wishes!

Magdoll commented 9 years ago

Hi,

ANGEL does things a little differently from ANGLE. In ANGLE, the training dataset comes from reference annotations (ex: RefSeq, Gencode) which is not always available.

So instead, ANGEL offers two ways to training data: by offering an external reference (and you just run angel_train.py on it); or by using the data itself as training (hence, dumb_predict.py).

The idea is that even though PacBio Iso-Seq output may have some errors, a lot of them will be close to or 100% accurate in the CDS region, so there should be sufficient features you can learn from it. Hence, dumb_predict.py implements a error-less ORF prediction + scoring scheme (same scheme used by Transdecoder).

In many cases, the output from dumb_predict.py might be the actual ORF because there are no errors in the CDS region. But to save the ones that have no or a broken ORF because of error, we run ANGEL.

angel_train.py takes both the UTR & CDS from dumb_predict.py to train features for coding regions. Then angel_predict.py will use those features to look for ORFs.

Hope this makes sense.

lamz138138 commented 9 years ago

Hi, Magdoll, thank you for your reply!

If we use dumb_predict.py, we assume all sequences are protein-coding, but there are many non-coding sequence, feature generated may be not true feature, so I think these should use with careful.

Thanks again! Best wishes!

lamz138138 commented 9 years ago

Hi, Magdoll!

I have another question about pbtranscript tofu, but I can't find how to setup issue there, so I post my question here.

I had installed pbtranscript tofu, but when I run collapse_isoforms_by_sam.py, it output error, I failed to get an answer, how should I do?

I installed "/opt/smrtanalysis_2.3.0.140936", then "cd /opt" and "ln -s smrtanalysis_2.3.0.140936 smrtanalysis". Then, export VENV_TOFU=/opt/zhongxm/VENV_TOFU . When "ls /opt/zhongxm/VENV_TOFU/lib/python2.7/site-packages/pbtools.pbtranscript-0.3.tofu.20150120-py2.7-linux-x86_64.egg/pbtools/pbtranscript/branch/", there no director "C". But when "ls /opt/zhongxm/cDNA_primer/pbtranscript-tofu/pbtools/pbtranscript/branch/", there is "C", so whether the problem was caused by no "C" in first path? I had tried "ln -s C " to VENV—TOFU, but it didn't work. And I install cDNA-primer with "Finished processing dependencies for pbtools.pbtranscript==0.3.tofu.20150120".

Following is error and steps I setup pbtranscript tofu.

The erro:

Traceback (most recent call last): File "/opt/zhongxm/VENV_TOFU/bin/collapse_isoforms_by_sam.py", line 5, in pkg_resources.run_script('pbtools.pbtranscript==0.3.tofu.20150120', 'collapse_isoforms_by_sam.py') File "/opt/zhongxm/VENV_TOFU/lib/python2.7/site-packages/pkg_resources.py", line 534, in run_script self.require(requires)[0].run_script(script_name, ns) File "/opt/zhongxm/VENV_TOFU/lib/python2.7/site-packages/pkg_resources.py", line 1434, in run_script execfile(script_filename, namespace, namespace) File "/opt/zhongxm/VENV_TOFU/lib/python2.7/site-packages/pbtools.pbtranscript-0.3.tofu.20150120-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 42, in from pbtools.pbtranscript.branch import branch_simple2 File "/opt/zhongxm/VENV_TOFU/lib/python2.7/site-packages/pbtools.pbtranscript-0.3.tofu.20150120-py2.7-linux-x86_64.egg/pbtools/pbtranscript/branch/branch_simple2.py", line 4, in import pbtools.pbtranscript.c_branch as c_branch File "c_branch.pyx", line 4, in init c_branch (pbtools/pbtranscript/branch/C/c_branch.c:5027) ImportError: No module named modified_bx_intervals.intersection_unique

Install pbtranscript tofu:

/opt/smrtanalysis/smrtcmds/bin/smrtshell wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.6.tar.gz tar zxf virtualenv-1.11.6.tar.gz -C /tmp/ export VENV_TOFU=/opt/zhongxm/VENV_TOFU python /tmp/virtualenv-1.11.6/virtualenv.py --system-site-packages -p /opt/smrtanalysis/current/redist/python2.7/bin/python $VENV_TOFU source $VENV_TOFU/bin/activate download cDNA_primer-master.zip unzip cDNA_primer-master.zip ln -s cDNA_primer-master cDNA_primer cd cDNA_primer/pbtranscript-tofu make

Thanks for your help! Best wishes!

Magdoll commented 9 years ago

Hi lamz138138,

Good point about possibly contaminated with ncRNA. From what I understand, one of the major definitions of ncRNA is that you cannot get a long (> 100 aa, say) ORF. So it is very unlikely you will get any ORF predictions from ncRNA and unlikely they will skew the feature training.

--Liz (Magdoll)

Magdoll commented 9 years ago

Hi lamz138138,

For the other error, can you please open a bug under cDNA_primer? (github.com/PacificBiosciences/cDNA_primer) I think I know where the problem is and may be able to solve it.

lamz138138 commented 9 years ago

Hi, Liz!

Thank you for the reply! I can't find bug in cDNA_primer, so I put my question in commit.

Best wishes!

jxu006 commented 9 years ago

Hi Liz,

I found an error when I run collapse_isoforms_by_sam.py, see below:

Traceback (most recent call last): File "/global/scratch2/sd/jianpeng/PacBio/cDNA_primer/bin/bin/collapse_isoforms_by_sam.py", line 5, in pkg_resources.run_script('pbtools.pbtranscript==0.3.tofu', 'collapse_isoforms_by_sam.py') File "/global/scratch2/sd/jianpeng/PacBio/cDNA_primer/bin/lib/python2.7/site-packages/pkg_resources.py", line 534, in run_script self.require(requires)[0].run_script(script_name, ns) File "/global/scratch2/sd/jianpeng/PacBio/cDNA_primer/bin/lib/python2.7/site-packages/pkg_resources.py", line 1434, in run_script execfile(script_filename, namespace, namespace) File "/global/scratch2/sd/jianpeng/PacBio/cDNA_primer/bin/lib/python2.7/site-packages/pbtools.pbtranscript-0.3.tofu-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 42, in from pbtools.pbtranscript.branch import branch_simple2 File "/global/scratch2/sd/jianpeng/PacBio/cDNA_primer/bin/lib/python2.7/site-packages/pbtools.pbtranscript-0.3.tofu-py2.7-linux-x86_64.egg/pbtools/pbtranscript/branch/branch_simple2.py", line 4, in import pbtools.pbtranscript.c_branch as c_branch File "c_branch.pyx", line 4, in init c_branch (pbtools/pbtranscript/branch/C/c_branch.c:5027) ImportError: No module named modified_bx_intervals.intersection_unique

My command to run it is: collapse_isoforms_by_sam.py --input all_quivered_hq.100_30_0.99.sorted.fastq --fq -s all_quivered_hq.100_30_0.99.sorted.sam -o final_output

Can you tell me what is going on? How to fix this?

Thanks,

Jack

Magdoll commented 9 years ago

Hi,

Sorry for the late reponse. I was away for vacation and just got back.

It looks like this might be an older version of code...can you confirm:

(1) that this is the latest TOFU (last updated 2 weeks ago) (2) can you please confirm that the first 4 lines in pbtranscript-tofu/pbtranscript/pbtools/pbtranscript/branch/C/c_branch.pyx is:

import numpy as np cimport numpy as np from cpython cimport bool from pbtools.pbtranscript.modified_bx_intervals.intersection_unique import IntervalTreeUnique, Interval

Lodela89 commented 9 years ago

Hi Liz,

I'm currently using Angel to predict ORFs. I use cds and utr sequences annotated by refseq. However I have a doubt. If I am right, the angel_make_training_set script removes redundancy in CDS sequences and takes 500 sequences as training data. So, is this step neccesary if I have refseq .utr and .cds files? I think it would be neccesary to make faster the classifer training step but I'm not sure. By the way, as more sequences you have in your training data better your prediction will be?

Thanks in advance

Lorena

Magdoll commented 9 years ago

Hi Lorena,

If the refseq UTR and CDS are already non-redundant, then you do NOT need to run make training set! You can directly use angel_train.py.

More sequences (or more precisely, diversity) should result in a more robust model. But you also do not want to take forever to train. The current training code is pretty slow, so I use either 500 sequences, or if it looks really too slow, I go down to 250 and the results are still decent for prediction.

Lodela89 commented 9 years ago

Hi Liz,

Thanks for your help!!

Best wishes,

Lorena

jxu006 commented 9 years ago

Hi Liz,

I have a question for IsoSeq rRNA contamination. I have the read of insert FASTA file.

If I want to detect how many reads can be mapped to rRNA, what aligner should I use to do the alignment? Do you think I can use BLASR? Do you have any suggestion on this?

Thanks,

Jianpeng

On Tue, Mar 3, 2015 at 9:11 AM, Lodela89 notifications@github.com wrote:

Hi Liz,

Thanks for your help!!

Best wishes,

Lorena

— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-76988640 .

jxu006 commented 9 years ago

Hi Liz,

I have a question for IsoSeq rRNA contamination. I have the read of insert FASTA file.

If I want to detect how many reads can be mapped to rRNA, what aligner should I use to do the alignment? Do you think I can use BLASR? Do you have any suggestion on this?

Thanks,

On Mon, Mar 9, 2015 at 4:25 PM, Jianpeng Xu jxu006@gmail.com wrote:

Hi Liz,

I have a question for IsoSeq rRNA contamination. I have the read of insert FASTA file.

If I want to detect how many reads can be mapped to rRNA, what aligner should I use to do the alignment? Do you think I can use BLASR? Do you have any suggestion on this?

Thanks,

Jianpeng

On Tue, Mar 3, 2015 at 9:11 AM, Lodela89 notifications@github.com wrote:

Hi Liz,

Thanks for your help!!

Best wishes,

Lorena

— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-76988640 .

Magdoll commented 9 years ago

Hi Jianpeng,

You can use BLASR to align to rRNA.

Using the official cDNA protocol, you should have little to no rRNA contamination.

See here: https://github.com/PacificBiosciences/cDNA_primer/wiki/Iso%E2%80%90Seq-protocol%3A-Bioinformatics-study-of-common-concerns

Also, I believe I have opened up "issue" in the cDNA wiki (https://github.com/PacificBiosciences/cDNA_primer/issues). For future issues related to Iso-Seq, please try to use that instead!

Thanks, --Liz

jxu006 commented 9 years ago

Thanks, Liz. I will use cDNA_primer in the future.

My Iso-Seq data is from fungal. Which rRNA data should I align my reads to? Should I align my Iso-Seq reads of insert FASTA file to a rRNA database?

I collected a rRNA database and it includes 115k rRNA FASTA sequences from different species. Can I align my reads to this rRNA database?

Thanks again,

Jianpeng

On Mon, Mar 9, 2015 at 4:38 PM, Magdoll notifications@github.com wrote:

Hi Jianpeng,

You can use BLASR to align to rRNA.

Using the official cDNA protocol, you should have little to no rRNA contamination.

See here:

https://github.com/PacificBiosciences/cDNA_primer/wiki/Iso%E2%80%90Seq-protocol%3A-Bioinformatics-study-of-common-concerns

Also, I believe I have opened up "issue" in the cDNA wiki ( https://github.com/PacificBiosciences/cDNA_primer/issues). For future issues related to Iso-Seq, please try to use that instead!

Thanks, --Liz

— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-77967832 .

jxu006 commented 9 years ago

Hi Liz,

I download the pbtranscript-tofu not very long time ago. But in the directory /cDNA_primer/pbtranscript-tofu, there are 3 folders: _pbtranscript, _pbtranscript_20150106_forYli, _pbtranscriptold

Which one should I use?

Thanks,

Jianpeng

On Mon, Mar 2, 2015 at 12:18 PM, Magdoll notifications@github.com wrote:

Hi,

Sorry for the late reponse. I was away for vacation and just got back.

It looks like this might be an older version of code...can you confirm:

(1) that this is the latest TOFU (last updated 2 weeks ago) (2) can you please confirm that the first 4 lines in pbtranscript-tofu/pbtranscript/pbtools/pbtranscript/branch/C/c_branch.pyx is:

import numpy as np cimport numpy as np from cpython cimport bool from pbtools.pbtranscript.modified_bx_intervals.intersection_unique import IntervalTreeUnique, Interval

— Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/ANGEL/issues/2#issuecomment-76807113 .

Magdoll commented 9 years ago

Hi,

PLease use pbtranscript/. The other two should really not be there..they are OLD archives. My bad :)

Magdoll commented 9 years ago

For fungal rRNA --- I don't know how different rRNA from different species are. If you have exactly the same species rRNA, use just that. Otherwise you certainly can use all of them, just remember there will be some false positive hits.