Using scripts to process protein data from NCBI

wang748 commented 1 month ago

Dear developer, I was trying to download the pep.all.fa files for the species “Conger conger” and “Gymnothorax javanicus” from the ensembl database. But I found that there are no relevant files for these two species in ensembl database, but there are reference files for these two species in NCBI, so I want to get the protein sequence files of these two species from NCBI, but the annotations of the protein files in NCBI are not very good, I would like to know if there is a script that can change the protein sequence files in NCBI to the annotation format in ensembl, that is, change their annotation format, that is to say, change their annotation format. I was wondering if there is a script that can transform the protein sequence files in NCBI to the annotation format in ensembl, i.e. replace their headers so that I can use primary_transcript.py to extract the longest transcript for each gene?

lauriebelch commented 1 month ago

Hi wang748,

We do have an experimental script for getting primary transcripts from NCBI data - if you provide me with links to the genomes you want on NCBI I can take a look

Thanks,

OrthoLaurie

wang748 commented 1 month ago

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/963/514/075/GCF_963514075.1_fConCon1.1/GCF_963514075.1_fConCon1.1_protein.faa.gz；https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/555/375/GCF_018555375.3_ASM1855537v3/GCF_018555375.3_ASM1855537v3_protein.faa.gz； The two links above are the protein sequences I need to extract their primary transcripts, please check them out, thank you very much!

lauriebelch commented 1 month ago

I'll take a look now. The script will be using the GFF files to extract the longest transcript per gene (similar to the primary transcripts script for ensembl). It will be published and available with the next version of orthofinder

lauriebelch commented 1 month ago

primary_transcripts.zip Hopefully this has worked! I would definitely check that the number of genes is what you are expecting for each species

wang748 commented 1 month ago

Thank you for your help! I would also like to know if your script derives some value by subtracting the end and start positions of the row where the mRNA is located in the third column of the gff file, and then comparing the magnitude of that value of the mRNAs belonging to the same gene, so as to derive, that the mRNA with the largest that value is the primary transcript? And if I don't do the extraction of the primary transcript, will it have a bad effect on the results generated by orthofinder?

lauriebelch commented 1 month ago

It works by mapping each protein ID in the protein fasta .faa file to a gene in the .gff file. For each gene we then have a set of protein IDs. We then simply take the longest protein (sequence length) for each gene as the primary transcript. I can send you the script if you want?

If we ran OrthoFinder on the raw files (without selecting only the primary transcripts) it would take 10x longer than necessary and could lower the accuracy.

wang748 commented 1 month ago

I think I need this script, thanks a lot! Here is my email address you can send the script to: 1296214047@qq.com

ferrojm commented 4 weeks ago

Hi! I would like that script as well, where can I find it? thanks!

lauriebelch commented 3 weeks ago

Getting data from NCBI.pdf ncbi_primary_transcripts.py.zip Here is the script, and a brief PDF explaining how to get data to use it. Please let me know if it is helpful / what might make it more helpful!

wang748 commented 2 weeks ago

test Hi, I did a little test using the protein sequence file and gff file of the above mentioned Anguilla rostrata species, but the resulting primary_transcripts folder is empty, am I not using it correctly? I put ncbi_primary_transcripts.py in the same folder with the protein sequence file and the gff file and ran “python ncbi_primary_transcripts.py” as shown in the picture. Thank you very much for solving my problem.

lauriebelch commented 2 weeks ago

I think the script is looking for a zip file that you downloaded from NCBI (not the un-zipped .faa and .gff files)

wang748 commented 2 weeks ago

Hi, I am using zip files and your script generates a folder and deposits the unzipped zip file in it and will report two path errors “Could not find required files in test/GCF_018555375.3ASM1855537v3 genomic.gff.zip” and ‘test/GCF_018555375.3_ASM1855537v3_protein.faa.zip’, I'm using Linux operating system, maybe because of my operation or something wrong with the file. So I decided to modify your script a bit, and since all my data is already decompressed in bulk, I only eliminated the operation of decompression from the script, so that this script reads the decompressed file directly and processes it again according to your logic.After testing,the resultant file is identical to the one you sent me. I'm worried that there might be something wrong with my script,so I'd like to ask you to take a look at it,thank you very much for your guidance! ncbi_primary_transcripts_w.zip

davidemms / OrthoFinder

Using scripts to process protein data from NCBI #930