arendsee / fagin

Classify genes using a syntenic filter
GNU General Public License v3.0
0 stars 0 forks source link

try to use unannotated ORF to run fagin #20

Open lijing28101 opened 7 years ago

lijing28101 commented 7 years ago

Hi @arendsee, I'm trying to use the ORF with at least 30nt as unannotated ORF for focal species. I use the gff from 2d_get_all_orfs.sh to instead of search.gff. I got a orphan list use the unannotated protein to against annotated and unannotated protein of other species. When I run fagin, it failed in make run. I think in make run, fagin just need orphan list, protein seq of orphan, focal vs other species.map.tab related to focal species. I changed all input file of focal species except genomic seq, I also used the new search.gff to get search interval. I think it should work in theory. Do you know why it failed? And I also post another issue before. When I add new species and change the orphan list, fagin could fail. I'm not sure whether the fail reason are the same as this one.

evewurtele commented 7 years ago

30 nt is a really tiny protein, and can't be searched well. try 50 aas, unless you chose 30 nt for some reason.

lijing28101 commented 7 years ago

I have tried 50 aas, it also failed.

arendsee commented 7 years ago

Fagin requires that genes you search are defined in the focal GFF. They must have full gene models. Where one gene contains one or more mRNAs which contain CDS and exons. Fagin requires all this information.

I could possibly generalize Fagin to take arbitrary intervals. But it would be a very different analysis, so I would need to implement it as a separate mode, or at least have the user set a flag in the config. Several checks and a lot of the interpretation in the report assume the query intervals are know genes.

This may be worth doing. But will take a lot of work.

evewurtele commented 7 years ago

Alternately, we need to figure out how to add potential genes to the GFF. Thats what I was suggesting to Jing and Urminder.

Eve Syrkin Wurtele, Professor 515-708-3232 (cell) 538 Science II, Iowa State Univ. homepage http://www.gdcb.iastate.edu/faculty-and-research/faculty/eve-syrkin-wurtele/ Meta!Blast computer game http://metablast.org/ PMR Systems Biology Resource http://www.metnetdb.org/PMR/ Publications http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1jMsIEbfU0gcCYtuh_aBiUdAs/?sort=date&direction=descending

Nature's first green is gold, Her hardest hue to hold. Her early leafs a flower; But only so an hour. Then leaf subsides to leaf. So Eden sank to grief, So dawn goes down to day. Nothing gold can stay.

On Tue, Jun 20, 2017 at 9:42 PM, Zebulun Arendsee notifications@github.com wrote:

Fagin requires that genes you search are defined in the focal GFF. They must have full gene models. Where one gene contains one or more mRNAs which contain CDS and exons. Fagin requires all this information.

I could possibly generalize Fagin to take arbitrary intervals. But it would be a very different analysis, so I would need to implement it as a separate mode, or at least have the user set a flag in the config. Several checks and a lot of the interpretation in the report assume the query intervals are know genes.

This may be worth doing. But will take a lot of work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arendsee/fagin/issues/20#issuecomment-309946874, or mute the thread https://github.com/notifications/unsubscribe-auth/ATY9X011x_XFLQ1g2Ja8T1dd17muikdHks5sGIMigaJpZM4N-5F1 .

lijing28101 commented 7 years ago

I used the original annotated gene model when I run make load. And then I replaced files in the input folders. The search.gff was instead by unannotated gff which is same as the gff produced in orf-gff folder, the DNA seq in gene was instead by the unannotated DNA seq, the protein seq in faa was instead by the same faa in orf-faa, only the file in trans-orf didn't change. After that I run 4_get-search-intervals.sh again to get new SI from the search.gff. I think fagin doesn't need the full gene model in make run, does it?

arendsee commented 7 years ago

Oh, adding new genes is easy. As far as Fagin is concerned, the genes in the focal species are exactly what is described in the GFF file. You just have include the mRNA model as well as the CDS. The problem with the ngORF is that they have a CDS but no mRNA model. So only part of the Fagin analysis can be run.

evewurtele commented 7 years ago

but the fagin analysis places each query gene in its context in the target genome. so why cant the CDS and the mRNA model be considered identical?

arendsee commented 7 years ago

@evewurtele @lijing28101 . Sorry for the late reply. I'm going back through old issues.

The practical problem here concerns GFF handling. If fagin sees a CDS in a GFF without a link to a parent mRNA, it dies. It doesn't know what the name of the gene is. Also, when fagin builds transcripts, it requires exon input. So each ngORF needs a CDS, an exon, and an mRNA. This seems pretty heavy since all three of these features store the same interval.

A deeper problem is that it is not really correct to just make up exons and mRNAs to go along with the CDS. For ngORFs, we don't know whether they are transcribed, let alone the structure of the gene model, so we shouldn't pretend like we do. There is actually a SOFA term that represents ngORFs: reading_frame (see here for a list of SOFA terms).

There are three solutions:

  1. I could program fagin to infer an ID for a CDS with no link to a parent.

  2. I could add handling of reading_frame GFF types

  3. The user could format the GFF, adding an mRNA row and an exon row for each ORF.