davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
694 stars 188 forks source link

Using scripts to process protein data from NCBI #930

Open wang748 opened 6 days ago

wang748 commented 6 days ago

Dear developer, I was trying to download the pep.all.fa files for the species “Conger conger” and “Gymnothorax javanicus” from the ensembl database. But I found that there are no relevant files for these two species in ensembl database, but there are reference files for these two species in NCBI, so I want to get the protein sequence files of these two species from NCBI, but the annotations of the protein files in NCBI are not very good, I would like to know if there is a script that can change the protein sequence files in NCBI to the annotation format in ensembl, that is, change their annotation format, that is to say, change their annotation format. I was wondering if there is a script that can transform the protein sequence files in NCBI to the annotation format in ensembl, i.e. replace their headers so that I can use primary_transcript.py to extract the longest transcript for each gene?

lauriebelch commented 4 days ago

Hi wang748,

We do have an experimental script for getting primary transcripts from NCBI data - if you provide me with links to the genomes you want on NCBI I can take a look

Thanks,

OrthoLaurie

wang748 commented 4 days ago

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/963/514/075/GCF_963514075.1_fConCon1.1/GCF_963514075.1_fConCon1.1_protein.faa.gz;https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/555/375/GCF_018555375.3_ASM1855537v3/GCF_018555375.3_ASM1855537v3_protein.faa.gz; The two links above are the protein sequences I need to extract their primary transcripts, please check them out, thank you very much!

lauriebelch commented 4 days ago

I'll take a look now. The script will be using the GFF files to extract the longest transcript per gene (similar to the primary transcripts script for ensembl). It will be published and available with the next version of orthofinder

lauriebelch commented 4 days ago

primary_transcripts.zip Hopefully this has worked! I would definitely check that the number of genes is what you are expecting for each species

wang748 commented 4 days ago

Thank you for your help! I would also like to know if your script derives some value by subtracting the end and start positions of the row where the mRNA is located in the third column of the gff file, and then comparing the magnitude of that value of the mRNAs belonging to the same gene, so as to derive, that the mRNA with the largest that value is the primary transcript? And if I don't do the extraction of the primary transcript, will it have a bad effect on the results generated by orthofinder?

lauriebelch commented 3 days ago

It works by mapping each protein ID in the protein fasta .faa file to a gene in the .gff file. For each gene we then have a set of protein IDs. We then simply take the longest protein (sequence length) for each gene as the primary transcript. I can send you the script if you want?

If we ran OrthoFinder on the raw files (without selecting only the primary transcripts) it would take 10x longer than necessary and could lower the accuracy.

wang748 commented 3 days ago

I think I need this script, thanks a lot! Here is my email address you can send the script to: 1296214047@qq.com