PacificBiosciences / ANGEL

Robust Open Reading Frame prediction (ANGLE re-implementation)
Other
16 stars 14 forks source link

Some problems about ANGEL.cds #14

Closed Trandamere closed 4 years ago

Trandamere commented 6 years ago

Hi: I have two problem about ANGEL.cds. This is my commandline:

dumb_predict.py --use_rev_strand remainpolyT.fasta remainpolyT.dumb --cpus 10

angel_make_training_set.py remainpolyT.dumb.final remainpolyT.dumb.final.traning --cpus 10

angel_train.py remainpolyT.dumb.final.traning.cds remainpolyT.dumb.final.traning.utr remainpolyT.dumb.final.classifier.pickle

angel_predict.py remainpolyT.fasta remainpolyT.dumb.final.classifier.pickle remainpolyT --use_rev_strand --output_mode=best --cpus 10

Then I find the result like this:

>LF210511/f1p0/4173|m.790 type:suspicious-NA len:257 strand:- pos:827-1597
ATGAGTAACCGCCATCTGCCGGCTGGGCAGAATATACAGGGAGGATCTGGCGTCCTAGGTGCCGACATGGTCGGTCCTGGAGGGCCTCGTCGGAGGCAGCCTCCTCCCTTTGTTCCCCAGTCCCAGTACCAGCAGCAACATCATCACCATCAAGCCGTAAATCACATGTATAACAACAACTACATGAACTATGGACAGCAGCAGTATTATGGATACCCGCCGCAGTATCAGACAGGTCACTACCAGAACGCTCAGTACCACAACGCGCAGTATCAAGGTGGACAATACCAAGCTGCACAGTTCCAGAATGGCCAGTACCAAAACGCACAGTACCACAACGCCGGAATGCCTTCACCCGGTGCTTATATGGGCTACCAGCAGCACTACGGACGATCGCCGCCCGTTCACCAGTTTGTCCCCATGTCTGGTGTGAGCGTACCCCCGAGCTTCCCAACCCGCCCAGCTCAGCAACAATCTCCTGCTCTGCCGACTCAGCCTCCTGCTCCAGCCTCACTTCCACCCCAGACTCCTACTTCAACCCACTCGTCGCAGATAATTCCTACTTCAACCCCCCCGGTCACGCAGGAGACTGAGCCAGCACCCCCCGCTCCTCCTGTTGCCCCCGCTGAGCCCCCACGACAACCTTCACCCGTTCCTGTCCCTGCTGCTGCTGCTGTTCCTGCCCCTGTTCATGTTCATGTTCATATTCCTGTTCCACAGGAACCATTCCGTGCACCTGTAAGTCTTGGCAGTTTCAATGTATTAAACTAA
>LF210511/f1p0/4173|m.791 type:suspicious-NA len:275 strand:- pos:2-826
ACTAACTTCGCTCAGCTGCCATGGTACTCTCACCCGGATGAAAAGTTCCCTGTTCGAACTAAAAGGCCGGGGCGATGGAGGAAGCGTCTCAATGCGGACAGTGCAAATGTTTCCCTGCCGGCTATTGACCAACATAACGCTGCTGCAGAGCAGGCCAGCGTTCCCGAAGCCAGCTCTACCGAACCTTCTGTCTCGGCCCTTACACCCGCGACATCAACGGCTCCATCGGAGGCTGCGGCAACTCCTCGCCAGTCTTCCGAGACTCCTGCTTCTGTTCAGCAACGATCACGCGCCAACACCGCCACCAGCGCTACCTCAACTTCGACGAACCGTCCTGCCACACGCTCCTCCGCTACTCCCGCCCCTGCTCTTCCATCGCTTCCTAAGGCAAACACTAAGGATGCTAAGCCTGCACGTGCTGAAAAGCCGGTAAACGGCGACGCAGCTACCGAAAGTGCCCCTGAGCAGGAGGTCACCGCTGAAGATTCCGAGAAGCCCGCGGAGTCTGAATCAACCGCTGCTGGGCCAGCTCCTGCTGTCAAAGCTCCACCTTCTAGTTGGGCGAAGCTTTTCTCGAAGCCCGCTTCTGCAGCTGCTGGAAAGACTGAGGAGTCTAATGGCGCCGCTCCCGTTGACACTGTTGCTAATGGCCGTGCCACCGAAAGCCCTGCTGGAACCCCTAATGGAGCTGCTCCCAGCTTCTCGAAAGTTAACGCCAACTCCGTTGCGGAGGCTATTCACACGTTCCATGTTGGTCTCGCGGATCAAGTTTCATTCCTCGAGCCCCGCGGTCTGATCAACACCGGGAACATGTGTTACATGAAC
>LF210511/f1p0/4173|m.792 type:suspicious-NA len:136 strand:- pos:2332-2739
ATGCCCAAGTACAAGTTGATTAGCGTGGTGTACCATCATGGTAAGAACGCTAGTGGTGGACATTACACTGTCGATGTGCGACGACAGGAAGGGCGCGAGTGGATTCGTATTGATGATACTTCCATCCGCCGAGTTCGAAGTGAAGATGTCGCTGAGGGCGGCGAAGAGGAGGAAGTAAAGAATACTCGTAAGGATGGCTCTTCATTGGGCAACCGGTTCGGTGCTGTTCTGGACGAAGACGCTGGAGATGATGACGGATGGAGCAAGGTCACTAGCCCTGCTGGAGGAGCAAAGAAATGGAGCAGCGTTGCCAACGGTACCAACGGCACTCCCAAGGCCGCCAAGCCGATCAAGGATAACATCAAGGACAACAAGGTTGCCTACCTGCTCTTCTACCAACGAGTATAA

First. I set the the option --output_mode=best. Why LF210511/f1p0/4173 have three cds? Second. Didn't the all the prediction of CDS begin with ATG?

Magdoll commented 6 years ago

Hi @Trandamere ,

Can you send me the input sequence of LF210511/f1p0/4173.

This is almost the correct behavior. Output mode best is comparing the ORFs being predicted by both dumb predict and ANGEL (smart) predict. output_mode=all means output both dumb and ANGEL ORFs, even if they are identical. output_mode=best means output only dumb or ANGEL ORFs, depending on which is longer.

Since you also said --use_rev_strand, the program is comparing dumb on + strand, dumb on - strand, ANGEL on + strand, and ANGEL on - strand and picking the one that seems best. In this case it picked "ANGEL - strand".

My issue here is I think ANGEL should have only output one ORF that goes from pos 2-1597 since it looks like the first two ORFs are basically together and the last one is 800 bp away and should have been excluded.

ANGEL allows non-ATG beginnings because 5' ends can be degraded in Iso-Seq data so the start codon could be missing.

--Liz