Closed lijing28101 closed 7 years ago
I've looked into this specific problem. The problem is that the GFF is inconsistent. For most of the genes, the CDS
is the child of an mRNA
, and the mRNA
is the child of a gene
.
For example:
NC_001133.9 RefSeq gene 1807 2169 . - . ID=PAU8;Parent=-
NC_001133.9 RefSeq mRNA 1807 2169 . - . ID=NM_001180043.1;Parent=PAU8
NC_001133.9 RefSeq CDS 1807 2169 . - 0 ID=NP_009332.1;Parent=NM_001180043.1
NC_001133.9 RefSeq gene 2480 2707 . + . ID=YAL067W-A;Parent=-
NC_001133.9 RefSeq mRNA 2480 2707 . + . ID=NM_001184582.1;Parent=YAL067W-A
NC_001133.9 RefSeq CDS 2480 2707 . + 0 ID=NP_878038.1;Parent=NM_001184582.1
However, for a few entries, the GFF lists the parent of the CDS
as a gene
, for example:
NC_001224.1 RefSeq gene 79213 80022 . + . ID=COX3;Parent=-
NC_001224.1 RefSeq CDS 79213 80022 . + 0 ID=NP_009328.1;Parent=COX3
I can't really fix this problem in a consistent way. Fagin requires mRNA information and these entries don't provide that. There are no exons listed. I think the best option is to simply ignore these genes, which is what Fagin does. I'll work on adding a better error message.
Also, this problem should be caught in the preprocessing stage. I may need to add a new check to gff-parser.py
.
@lijing28101 The new version of fagin
handles this case.
There are several warning messages in the report about the protein name:
I have checked these protein. They are all in the GFF file, and they are the parent of CDS.
There is no error when I run
make load
, all the protein names in faa files are match with the parent of CDS in the gff file in the input folder.The genes in
orphan-list
have the name match with the parent of CDS. I have checked several of them. There are no problem.