RobertsLab / resources

https://robertslab.github.io/resources/
18 stars 11 forks source link

Annotate Acropora genome #715

Closed sr320 closed 4 years ago

sr320 commented 5 years ago

Supposedly the GFF at https://www.ncbi.nlm.nih.gov/genome/?term=txid70779%5bOrganism:noexp

is broken.

When Mox (or GenSas?) is available it would nice to get an annoation going

kubu4 commented 5 years ago

GFF is definitely jacked.

Putting this here to find more easily - link to genome FastA file: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/222/465/GCA_000222465.2_Adig_1.1/GCA_000222465.2_Adig_1.1_genomic.fna.gz

kubu4 commented 5 years ago

Turns out, a GFF can be downloaded from that page you linked above, and, more importantly, that GFF is good!

image

sr320 commented 5 years ago

The coral people say it is not good / accurate. I can inquire why they say that.

On Sep 17, 2019, 10:00 AM -0700, kubu4 notifications@github.com, wrote:

Turns out, a GFF can be downloaded from that page you linked above, and, more importantly, that GFF is good! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

sr320 commented 5 years ago

feedback:

I couldn’t find transposable elements for example. The original publication is from 2011, and although there is a more recent assembly on NCBI, there isn’t a publication linked to it. In addition, the GFF file on NCBI is broken and the authors do not respond to requests to provide an updated file.

kubu4 commented 5 years ago

In regards to broken link, they're correct if they use that FTP link I posted above. However, if you use the link that's on the page (from the screenshot), it's a proper GFF.

Transposable elements are not present in the GFF. This seems to be standard procedure (based on our experience with MAKER and GenSAS), so it's not surprising. TE's can be found in the RepeatMaker output file:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/222/465/GCA_000222465.2_Adig_1.1/GCA_000222465.2_Adig_1.1_rm.out.gz

kubu4 commented 5 years ago

Note, the RepeatMasker output file is not a GFF, but has all the genomic coordinates needed.

EDITED: Fixed wording.

sr320 commented 5 years ago

I was reading "broken" as inaccurate, not literally broken link. I will inquire.

kubu4 commented 5 years ago

To summarize my "findings":

  1. GFF from this directory is broken (i.e. has a file, but file is screwed up):

    • ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/222/465/GCA_000222465.2_Adig_1.1/
  2. GFF from screenshot is good:

    • ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/222/465/GCF_000222465.1_Adig_1.1/GCF_000222465.1_Adig_1.1_genomic.gff.gz
  3. Transposable elements are in this file:

    • ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/222/465/GCF_000222465.1_Adig_1.1/GCF_000222465.1_Adig_1.1_rm.out.gz

I haven't taken the time to try to figure out the difference between "GCA" and "GCF" versions. However, the assembly_status.txt file in ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/222/465/GCF_000222465.1_Adig_1.1/ has an updated time stamp of 9/16/2019 and indicates this is the "latest" assembly. Additionally, the fact that the links on the genome info page link to the "GCF" versions of files, suggests that these are the canonical files for this. So, I'd use them instead of the "GCA" versions.

EDITED: Improve readability.

github-actions[bot] commented 4 years ago

Stale issue message