arendsee / fagin

Classify genes using a syntenic filter
GNU General Public License v3.0
0 stars 0 forks source link

Warning messages for protein name #8

Closed lijing28101 closed 7 years ago

lijing28101 commented 7 years ago

There are several warning messages in the report about the protein name:

## Warning in force aa agreement(gff = gff, aa = aa): 25 proteins are not represented in the
GFF file. If this number is
##
small, maybe there is just somehting odd about your GFF file. If it is
##
big, or equal to the total number of proteins, then something is very
##
wrong. These entries will be deleted from the protein file: AI1, AI2, AI3, AI4, AI5 ALPHA,
AI5 BETA, ATP6, ATP8, BI2, BI3, BI4, COB, COX1, COX2, COX3, CRS5, OLI1, Q0255, SCEI, SDL1,
VAR1, FLO8, YDR134C, YIR044C, YOL153C

I have checked these protein. They are all in the GFF file, and they are the parent of CDS.

## Warning in check gene aa agreement(genes = genes, aa = aa): Protein names do not match gene
model names. This probably means you are
##
not bypassing the input methods I wrote (e.g. 2 extract fasta.sh). Not
##
cool.

There is no error when I run make load, all the protein names in faa files are match with the parent of CDS in the gff file in the input folder.

## Warning in x(...): All genes listed in the query gene file (probably ’orphan-list.txt’)
##
must be among the proteins present in ’/home/jingli/Desktop/fagin new/fagin/input/faa/Saccharomyce
but they aren’t.

The genes in orphan-list have the name match with the parent of CDS. I have checked several of them. There are no problem.

arendsee commented 7 years ago

I've looked into this specific problem. The problem is that the GFF is inconsistent. For most of the genes, the CDS is the child of an mRNA, and the mRNA is the child of a gene.

For example:

NC_001133.9 RefSeq  gene    1807    2169    .   -   .   ID=PAU8;Parent=-
NC_001133.9 RefSeq  mRNA    1807    2169    .   -   .   ID=NM_001180043.1;Parent=PAU8
NC_001133.9 RefSeq  CDS 1807    2169    .   -   0   ID=NP_009332.1;Parent=NM_001180043.1
NC_001133.9 RefSeq  gene    2480    2707    .   +   .   ID=YAL067W-A;Parent=-
NC_001133.9 RefSeq  mRNA    2480    2707    .   +   .   ID=NM_001184582.1;Parent=YAL067W-A
NC_001133.9 RefSeq  CDS 2480    2707    .   +   0   ID=NP_878038.1;Parent=NM_001184582.1

However, for a few entries, the GFF lists the parent of the CDS as a gene, for example:

NC_001224.1 RefSeq  gene    79213   80022   .   +   .   ID=COX3;Parent=-
NC_001224.1 RefSeq  CDS 79213   80022   .   +   0   ID=NP_009328.1;Parent=COX3

I can't really fix this problem in a consistent way. Fagin requires mRNA information and these entries don't provide that. There are no exons listed. I think the best option is to simply ignore these genes, which is what Fagin does. I'll work on adding a better error message.

arendsee commented 7 years ago

Also, this problem should be caught in the preprocessing stage. I may need to add a new check to gff-parser.py.

arendsee commented 7 years ago

@lijing28101 The new version of fagin handles this case.