Open lijing28101 opened 7 years ago
30 nt is a really tiny protein, and can't be searched well. try 50 aas, unless you chose 30 nt for some reason.
I have tried 50 aas, it also failed.
Fagin requires that genes you search are defined in the focal GFF. They must have full gene models. Where one gene contains one or more mRNAs which contain CDS and exons. Fagin requires all this information.
I could possibly generalize Fagin to take arbitrary intervals. But it would be a very different analysis, so I would need to implement it as a separate mode, or at least have the user set a flag in the config. Several checks and a lot of the interpretation in the report assume the query intervals are know genes.
This may be worth doing. But will take a lot of work.
Alternately, we need to figure out how to add potential genes to the GFF. Thats what I was suggesting to Jing and Urminder.
Eve Syrkin Wurtele, Professor 515-708-3232 (cell) 538 Science II, Iowa State Univ. homepage http://www.gdcb.iastate.edu/faculty-and-research/faculty/eve-syrkin-wurtele/ Meta!Blast computer game http://metablast.org/ PMR Systems Biology Resource http://www.metnetdb.org/PMR/ Publications http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1jMsIEbfU0gcCYtuh_aBiUdAs/?sort=date&direction=descending
Nature's first green is gold, Her hardest hue to hold. Her early leafs a flower; But only so an hour. Then leaf subsides to leaf. So Eden sank to grief, So dawn goes down to day. Nothing gold can stay.
On Tue, Jun 20, 2017 at 9:42 PM, Zebulun Arendsee notifications@github.com wrote:
Fagin requires that genes you search are defined in the focal GFF. They must have full gene models. Where one gene contains one or more mRNAs which contain CDS and exons. Fagin requires all this information.
I could possibly generalize Fagin to take arbitrary intervals. But it would be a very different analysis, so I would need to implement it as a separate mode, or at least have the user set a flag in the config. Several checks and a lot of the interpretation in the report assume the query intervals are know genes.
This may be worth doing. But will take a lot of work.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arendsee/fagin/issues/20#issuecomment-309946874, or mute the thread https://github.com/notifications/unsubscribe-auth/ATY9X011x_XFLQ1g2Ja8T1dd17muikdHks5sGIMigaJpZM4N-5F1 .
I used the original annotated gene model when I run make load
. And then I replaced files in the input folders. The search.gff
was instead by unannotated gff which is same as the gff produced in orf-gff
folder, the DNA seq in gene
was instead by the unannotated DNA seq, the protein seq in faa
was instead by the same faa in orf-faa
, only the file in trans-orf
didn't change. After that I run 4_get-search-intervals.sh
again to get new SI from the search.gff
. I think fagin doesn't need the full gene model in make run
, does it?
Oh, adding new genes is easy. As far as Fagin is concerned, the genes in the focal species are exactly what is described in the GFF file. You just have include the mRNA model as well as the CDS. The problem with the ngORF is that they have a CDS but no mRNA model. So only part of the Fagin analysis can be run.
but the fagin analysis places each query gene in its context in the target genome. so why cant the CDS and the mRNA model be considered identical?
@evewurtele @lijing28101 . Sorry for the late reply. I'm going back through old issues.
The practical problem here concerns GFF handling. If fagin
sees a CDS in a GFF without a link to a parent mRNA, it dies. It doesn't know what the name of the gene is. Also, when fagin
builds transcripts, it requires exon input. So each ngORF needs a CDS, an exon, and an mRNA. This seems pretty heavy since all three of these features store the same interval.
A deeper problem is that it is not really correct to just make up exons and mRNAs to go along with the CDS. For ngORFs, we don't know whether they are transcribed, let alone the structure of the gene model, so we shouldn't pretend like we do. There is actually a SOFA term that represents ngORFs: reading_frame (see here for a list of SOFA terms).
There are three solutions:
I could program fagin
to infer an ID for a CDS with no link to a parent.
I could add handling of reading_frame
GFF types
The user could format the GFF, adding an mRNA row and an exon row for each ORF.
Hi @arendsee, I'm trying to use the ORF with at least 30nt as unannotated ORF for focal species. I use the gff from
2d_get_all_orfs.sh
to instead of search.gff. I got a orphan list use the unannotated protein to against annotated and unannotated protein of other species. When I run fagin, it failed inmake run
. I think inmake run
, fagin just need orphan list, protein seq of orphan, focal vs other species.map.tab related to focal species. I changed all input file of focal species except genomic seq, I also used the new search.gff to get search interval. I think it should work in theory. Do you know why it failed? And I also post another issue before. When I add new species and change the orphan list, fagin could fail. I'm not sure whether the fail reason are the same as this one.