All of this should be fine. The problem is that the gene column of this gff doesn't give the gene name if there is no common name ex (mex-3) it gives the transcript name.
My quick fix for this is to just edit the transcript name down to the gene name.
According to Wormbase nomenclature:
The CDS transcript name is "derived from the same Sequence Name as their parent Gene object, so the gene F38H4.7 has a CDS called F38H4.7." Isoforms of a transcript are denoted like this:
"The gene bli-4 has 10 known CDS isoforms, called K04F10.4a, K04F10.4b, K04F10.4c, K04F10.4d, K04F10.4e, K04F10.4f, K04F10.4g, K04F10.4h, K04F10.4i, and K04F10.4j."
Heres the fix and some test data
gene chr strand txstart txend wbgene gene_name biotype type
Y48G1C.12.1 I + 47461 49860 WBGene00044345 Y48G1C.12.1 protein_coding Transcript
Y48G1C.4a.1 I + 49921 54426 WBGene00021677 pgs-1 protein_coding Transcript
Y48G1C.4b.1 I + 52370 54360 WBGene00021677 pgs-1 protein_coding Transcript
Y48G1C.5.1 I - 55293 64066 WBGene00021678 Y48G1C.5.1 protein_coding Transcript
W05F2.4f.1 I - 3375099 3402811 WBGene00021036 W05F2.4f.1 protein_coding Transcript
The process
make_flat_file
takes a modified gff for each species from the species reference dir. Ex) c_elegans.gff.So the gene col actually contains the transcript name.
This file is joined to an early version of the flat file
All of this should be fine. The problem is that the gene column of this gff doesn't give the gene name if there is no common name ex (mex-3) it gives the transcript name.
My quick fix for this is to just edit the transcript name down to the gene name.
According to Wormbase nomenclature:
The CDS transcript name is "derived from the same Sequence Name as their parent Gene object, so the gene F38H4.7 has a CDS called F38H4.7." Isoforms of a transcript are denoted like this:
"The gene bli-4 has 10 known CDS isoforms, called K04F10.4a, K04F10.4b, K04F10.4c, K04F10.4d, K04F10.4e, K04F10.4f, K04F10.4g, K04F10.4h, K04F10.4i, and K04F10.4j."
Heres the fix and some test data