Closed KamilSJaron closed 5 years ago
Alternatively, or complementary we could get them also from the MUMmer output. Palindromes on the nucleotide level.
this analysis is done for all the species with assembly and annotation.
Yet, the outputs need to be parsed.
I found that a lot of the annotation files do not contain gene annotations.
I am not sure why _genomic.gff.gz
file sometimes contains only contigs and sometimes gene annotation. Also some genomes have a separated ftp with annotation at /genomes/.gff3
file. What makes these genomes special and why there is more than one way to store the annotation? No idea, but I got to curate the files and check that the annotation file actually contain what we need.
Examples: gff = contigs : Lcla1 gff = contigs + genes : Dcor1 separeted gff3 = Obir1
Species with (some sort of) result: Avag1 Dcor1 Fcan1 Lcla1
For the rest: TODO get protein sequences
Pvir1 annotation added, maybe incompatible with the NCBI annot.
This is a mess. Most of the genomes have no annotation in NCBI and the NCBI genome has no corresponding scaffold IDs to the annotation that is somewhere on ftp servers.
To finish this task we would have to:
gffread -g $GENOME_UNZIPED -y $PROTEINS $GFF_UNZIPED
, where GENOME_UNZIPED is just the unzipped genome sequence, GFF_UNZIPED is unzipped annotation (I suppose it will work with gtf as well) and PROTEINS is name of the output proteins. If the output file is empty it means that scaffold names in the annotation and genome do not match, or that the annotation does not contain annotation of genesscf# gene starting_position ending_position
. Some examples how to reformat gff3 file with awk are in scripts/prepare_data_for_MCScanX.sh.Note that I wrote originally these script for all the genomes, but then I figured out that the scripts are not generally usable, therefore they are sort of "collection of copy-paste commans".
Missing annotations:
Anan1 - done
Mare3 - does not really matter sine we have annotatios for Mare1 and 2, but still. This assembly should be nicer - it's the lattest
Dpul1 - sexual reference, annotation is sexual annotation
Aruf1 - annotation does not exist
Anan1 - Philipp gave me this dropbox link to the GFF https://www.dropbox.com/s/nrx6ccq3d3eaepd/Acrobeloides_nanus_v1.gff3.gz?dl=0
Mare3- the annotation should be on Wormbase within the next 3 weeks he said. He said it is better if we get it from there, because they reformat it and so on.
MCScanX is done for all but three mentioned above (Anan1 is done).
TODO: the last one is Mare3
c89fdd0a82d7eb80e58518f766321c674dadcc5a