Sequence IDs in single copy orthologs from RefSeq cds data

I love Orthofinder, it has been so helpful in our research in so many ways.

I am having one particular issue where I am getting non-informative sequence IDs in all the fastas in the Single_Copy_Orthologue_Sequences output directory. Ideally, especially for coding sequences, these sequence IDs would preserve the species name reflective of the filenames of the input sequences, making it extremely useful for downstream analysis (phylogenetics, tests for selection, etc). My data is downloaded using the ncbi_datasets command-line tool. I am using Orthofinder on proteins and coding sequences in parallel.

I first downloaded the proteomes for a sample of species, and the sequence IDs in the downloaded data look like this:

XP_053544643.1 thymidine kinase 2, mitochondrial isoform X2 [Bombina bombina]

After running primary_transcript.py, the sequence header in the out put files look like this:

XP_053544644.1 copine-7 [Bombina bombina]

And in the Single_Copy_Orthologue_Sequences directory, sequence IDs look like this (this is not the same gene as above):

XP_053545752.1

It preserves the protein ID but not the species name. I suppose I could use a table to switch out all the names for species_genus names, but is there any advice on how to modify the code to include the species name?

I then downloaded the cds data for the same species. The problem is worse for cds data downloaded from NCBI. Here is an example of a sequence ID from the downloaded cds data:

lcl|NC_069499.1_cds_XP_053552928.1_1 [gene=LIN52] [db_xref=GeneID:128644283] [protein=protein lin-52 homolog] [protein_id=XP_053552928.1] [location=complement(join(240756..240811,425704..425787,498199..498265,498514..498551,565189..565263,565378..565516))] [gbkey=CDS]

After running primary_transcript.py, the sequence headers in the output files look like this:

LIN52]

And in the Single_Copy_Orthologue_Sequences directory, sequence IDs look like this (this is not the same gene as above):

CAMK2A] CAMK2A]

You can see all sequence IDs in this single copy ortholog file have the gene name but are noninformative as far as species name. Any advice on how to get these IDs labeled properly (ideally including both gene and species)?

davidemms / OrthoFinder

Sequence IDs in single copy orthologs from RefSeq cds data #935