davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
709 stars 189 forks source link

Sequence IDs in single copy orthologs from RefSeq cds data #935

Open marctollis opened 2 weeks ago

marctollis commented 2 weeks ago

I love Orthofinder, it has been so helpful in our research in so many ways.

I am having one particular issue where I am getting non-informative sequence IDs in all the fastas in the Single_Copy_Orthologue_Sequences output directory. Ideally, especially for coding sequences, these sequence IDs would preserve the species name reflective of the filenames of the input sequences, making it extremely useful for downstream analysis (phylogenetics, tests for selection, etc). My data is downloaded using the ncbi_datasets command-line tool. I am using Orthofinder on proteins and coding sequences in parallel.

I first downloaded the proteomes for a sample of species, and the sequence IDs in the downloaded data look like this:

XP_053544643.1 thymidine kinase 2, mitochondrial isoform X2 [Bombina bombina]

After running primary_transcript.py, the sequence header in the out put files look like this:

XP_053544644.1 copine-7 [Bombina bombina]

And in the Single_Copy_Orthologue_Sequences directory, sequence IDs look like this (this is not the same gene as above):

XP_053545752.1

It preserves the protein ID but not the species name. I suppose I could use a table to switch out all the names for species_genus names, but is there any advice on how to modify the code to include the species name?

I then downloaded the cds data for the same species. The problem is worse for cds data downloaded from NCBI. Here is an example of a sequence ID from the downloaded cds data:

lcl|NC_069499.1_cds_XP_053552928.1_1 [gene=LIN52] [db_xref=GeneID:128644283] [protein=protein lin-52 homolog] [protein_id=XP_053552928.1] [location=complement(join(240756..240811,425704..425787,498199..498265,498514..498551,565189..565263,565378..565516))] [gbkey=CDS]

After running primary_transcript.py, the sequence headers in the output files look like this:

LIN52]

And in the Single_Copy_Orthologue_Sequences directory, sequence IDs look like this (this is not the same gene as above):

CAMK2A] CAMK2A]

You can see all sequence IDs in this single copy ortholog file have the gene name but are noninformative as far as species name. Any advice on how to get these IDs labeled properly (ideally including both gene and species)?

lauriebelch commented 1 week ago

Hi marctollis,

To get species_genus names alongside the gene names in the sequence files I think your best bit is to just pull the species names from the orthogroups file

For cds data I would look into the script that we provide for getting data from NCBI. It is designed to extract primary transcripts from proteomes, but could in theory be modified to do the same for CDS https://github.com/davidemms/OrthoFinder/issues/930