davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
683 stars 186 forks source link

primary_transcript.py does not return one isoform per gene #761

Open tsackton opened 1 year ago

tsackton commented 1 year ago

Hello,

I have been investigating some odd results for an OrthoFinder run using NCBI data and new annotations from about 15 species of ducks and other fowl, and I think that I have found a bug in the primary_transcript.py script, specifically the logic used to decide if two proteins are isoforms of the same gene or not in NCBI data.

Looking at Anas platyrhynchos (assembly GCF_003850225.1 and the corresponding NCBI annotation), I see that the primary_transcript.py script returns 18,619 proteins, which should be one isoform per gene, for input to OrthoFinder. However, that particular annotation only has 15,747 genes according to the NCBI annotation report.

Digging further, I can find examples where there are multiple isoforms of the same gene that are not labeled as "isoform X" or "isoform " in the GFF file. Here is an example for the LHFPL2 gene:

NC_040075.1 gene    ID=gene-LHFPL2;
NC_040075.1 mRNA    ID=rna-XM_027447126.1;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_027302927.1;Parent=rna-XM_027447126.1;
NC_040075.1 CDS ID=cds-XP_027302927.1;Parent=rna-XM_027447126.1;
NC_040075.1 mRNA    ID=rna-XM_027447123.1;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_027302924.1;Parent=rna-XM_027447123.1;
NC_040075.1 CDS ID=cds-XP_027302924.1;Parent=rna-XM_027447123.1;
NC_040075.1 mRNA    ID=rna-XM_027447124.1;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_027302925.1;Parent=rna-XM_027447124.1;
NC_040075.1 CDS ID=cds-XP_027302925.1;Parent=rna-XM_027447124.1;
NC_040075.1 mRNA    ID=rna-XM_027447125.1;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_027302926.1;Parent=rna-XM_027447125.1;
NC_040075.1 CDS ID=cds-XP_027302926.1;Parent=rna-XM_027447125.1;
NC_040075.1 mRNA    ID=rna-XM_005008960.4;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_005009017.1;Parent=rna-XM_005008960.4;
NC_040075.1 CDS ID=cds-XP_005009017.1;Parent=rna-XM_005008960.4;
NC_040075.1 mRNA    ID=rna-XM_005008968.4;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_005009025.1;Parent=rna-XM_005008968.4;
NC_040075.1 CDS ID=cds-XP_005009025.1;Parent=rna-XM_005008968.4;
NC_040075.1 mRNA    ID=rna-XM_013105168.3;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_012960622.1;Parent=rna-XM_013105168.3;
NC_040075.1 CDS ID=cds-XP_012960622.1;Parent=rna-XM_013105168.3;
NC_040075.1 mRNA    ID=rna-XM_005008962.4;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_005009019.1;Parent=rna-XM_005008962.4;
NC_040075.1 CDS ID=cds-XP_005009019.1;Parent=rna-XM_005008962.4;
NC_040075.1 mRNA    ID=rna-XM_005008966.4;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_005009023.1;Parent=rna-XM_005008966.4;
NC_040075.1 CDS ID=cds-XP_005009023.1;Parent=rna-XM_005008966.4;
NC_040075.1 mRNA    ID=rna-XM_005008961.4;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_005009018.1;Parent=rna-XM_005008961.4;
NC_040075.1 CDS ID=cds-XP_005009018.1;Parent=rna-XM_005008961.4;
NC_040075.1 mRNA    ID=rna-XM_005008963.4;Parent=gene-LHFPL2;
NC_040075.1 CDS ID=cds-XP_005009020.1;Parent=rna-XM_005008963.4;
NC_040075.1 CDS ID=cds-XP_005009020.1;Parent=rna-XM_005008963.4;

For clarity I have only included part of the GFF. But if we look up those XP ids, we see that at least some of them are not indicated as isoforms, e.g.:

>XP_005009020.1 LHFPL tetraspan subfamily member 2 protein [Anas platyrhynchos]
>XP_005009018.1 LHFPL tetraspan subfamily member 2 protein [Anas platyrhynchos]

Based on the GFF file, these two proteins are clearly isoforms of the same gene, but the primary_transcript.py script treats them as separate genes, since they do not have 'isoform' in their name.

There does not seem to be an obvious solution to this based only on the information provided in the fasta header for protein files downloaded from NCBI; I suspect GFF parsing is the only reliable approach to accurately identify the longest protein from NCBI annotations.

SheepwormJM commented 1 year ago

Hi,

I too have found a similar thing. I haven't been using a gff file for the primary_transcripts.py though.

My code: for f in *fa ; do python /Orthofinder/OrthoFinder/tools/primary_transcript.py $f ; done

For one species it worked fine. For this species, each protein sequence had the following (or very similar) header: >HCON_00022890-00001 transcript=HCON_00022890-00001 gene=HCON_00022890 Secondary (or more) transcripts were denoted by -00002 etc. But otherwise all headers are the same.

It selected the same number of genes as I would expect to have primary transcripts.

For another species, C. elegans, it didn't work as well. It retained 25803 genes out of 28447 'accesssions', while I found only 19985 genes. The headers for this file were different, and I think they are causing an issue. I have a feeling (only really looked at the first page), that those with the header the same as the first two lines are processed correctly, and the rest may not be.

Initial file: `

2L52.1a wormpep=CE32090 gene=WBGene00007063 status=Confirmed uniprot=A4F336 insdc=CCD61130.1 2L52.1b wormpep=CE50569 gene=WBGene00007063 status=Confirmed uniprot=A0A0K3AWR5 insdc=CTQ86426.1 2RSSE.1a wormpep=CE32785 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=A4F337 insdc=CCD61138.1 product="Rho-GAP domain-containing protein" 2RSSE.1b wormpep=CE48524 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=U4PAK0 insdc=CDH92922.1 product="Rho-GAP domain-containing protein" 2RSSE.1c wormpep=CE52653 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=A0A2X0T1Z3 insdc=SPS41576.1 product="Rho-GAP domain-containing protein" 3R5.1a wormpep=CE24758 gene=WBGene00007065 locus=pot-3 status=Confirmed uniprot=G5EFG7 insdc=CAA21777.2 product="POT1PC domain-containing protein" `

Primary transcript file: `

WBGene00007063 2RSSE.1a wormpep=CE32785 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=A4F337 insdc=CCD61138.1 product="Rho-GAP domain-containing protein" 2RSSE.1b wormpep=CE48524 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=U4PAK0 insdc=CDH92922.1 product="Rho-GAP domain-containing protein" 2RSSE.1c wormpep=CE52653 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=A0A2X0T1Z3 insdc=SPS41576.1 product="Rho-GAP domain-containing protein" 3R5.1a wormpep=CE24758 gene=WBGene00007065 locus=pot-3 status=Confirmed uniprot=G5EFG7 insdc=CAA21777.2 product="POT1PC domain-containing protein" ` Does this help?

Version is the latest (2.5.4).

All the best, Jenni

SheepwormJM commented 1 year ago

Today I changed the headers and re-ran the primary_transcripts.py. I now have the correct number of genes I would expect from each species. I just used sed to remove anything after the gene name in the header. Not all files had the same kind of header, but within each file all had the transcript ID and the gene=xxx, and all headers within a file were in the same format. Hope this helps.

mduchoslav commented 1 year ago

Hi, I am dealing with the same issue. In my case, I have problem with Arabidopsis lyrata files (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/255/GCF_000004255.2_v.1.0/). The script is leaving there isoforms for some genes. It seems that it is not possible to get the information that the proteins belong to the same gene just from the fasta header and it is needed to use the GFF annotation file. It would be useful to at least warn the users that the primary_transcript.py might not work properly for NCBI data. Best regards, Milos