Any suggestions for an easy way to generate files containing CDS sequences for ortholog groups?

esforsythe commented 9 years ago

Hello, I used orthofinder to cluster orthologous groups from the proteomes of 7 plants and it worked great! I plan to infer gene trees for ~6000 of these OGs. I used trees_for_orthogroups.py to generate a separate file for each OG (output in the Sequences/ directory). My problem is that I would like to use DNA sequences for downstream analyses (rather than the protein sequence). SO, I would like to generate a separate file for each OG that contains the CDS (DNA) sequences of the orthologs. I have CDS fasta files for each of the species but I am faced with the challenge of extracting the appropriate sequences and putting them into the appropriate files. I don't imagine it would be too difficult to code but I figured I'd see if anyone already has a solution so that I don't have to re-invent the wheel! Thank you very much!

tarunaaggarwal commented 8 years ago

Hi! I have a python script that does exactly that. Do you still need to pull out corresponding CDS seqs for your orthogroups?

drelo commented 8 years ago

Yes, I am still trying to use this approach ahead. I recently received more data that I must incorporate in the workflow, the script would be very useful. Thanks in advance.

El sáb., 23 de ene. de 2016 a la(s) 10:07, Taruna notifications@github.com escribió:

Hi! I have a python script that does exactly that. Do you still need to pull out corresponding CDS seqs for your orthogroups?

— Reply to this email directly or view it on GitHub https://github.com/davidemms/OrthoFinder/issues/8#issuecomment-174185204 .

Postdoctoral Associate

Instituto de Cs Ambientales y Evolutivas Universidad Austral de Chile Campus Isla Teja s/n Valdivia, Chile

tarunaaggarwal commented 8 years ago

This script is specific to my genomes (file paths, the identifiers Orthofinder chooses from the headers) Will you be able to change a few things in it if I can specify exactly how? Here it is...

https://github.com/macmanes-lab/GeosmithiaComparativeGenomics/blob/master/proteinOrthos2dnaOrthos_TA.py

tarunaaggarwal commented 8 years ago

@drelo Hi! I'm sorry I should asked before sending you this link to the script...You have to name your files in a specific manner before running Orthofinder and I'm guessing you have already run Orthofinder, right?

drelo commented 8 years ago

Yeah I already did, let me know how to run or prepare the run, thanks again. I also didn't get this " identifiers Orthofinder chooses from the headers" but I guess it's related to the stuff above.

El sáb., 23 de ene. de 2016 a la(s) 10:20, Taruna notifications@github.com escribió:

@drelo https://github.com/drelo Hi! I'm sorry I should asked before sending you this link to the script...You have to name your files in a specific manner before running Orthofinder and I'm guessing you have already run Orthofinder, right?

— Reply to this email directly or view it on GitHub https://github.com/davidemms/OrthoFinder/issues/8#issuecomment-174186104 .

Postdoctoral Associate

Instituto de Cs Ambientales y Evolutivas Universidad Austral de Chile Campus Isla Teja s/n Valdivia, Chile

tarunaaggarwal commented 8 years ago

Ah I see! Well, that makes things a bit complicated and here's why. I will start by explaining why I mean by identifiers. Also, you should know that this script was used for orthogroup files that contain single-copy orthologs. But it should work for any orthogroup file(s) so as long as you don't have duplicate headers in those files. I don't think you will but you may want to double check some of the headers and ensure they are unique (their sequences being paralogs won't affect anything). Okay, back to identifiers. The identifiers are the part of the headers from pep files that Orthofinder chooses to tack on to the headers in the orthogroup fasta files.

For instance:

You pep files might be named as such

Zymoseptoria_tritici.MG2.30.pep.all.fa
Fusarium_solani.v2.0.30.pep.all.all.fa
Grosmannia_clavigera_kw1407.GCA_000143105.2.30.pep.all.fa
Talaromyces_aculeatus.ATCC_10409.v1.0.pep.all.fa

Let's look at some headers of Zymoseptoria_tritici.MG2.30.pep.all.fa

>Mycgr3P32211 pep:known chromosome:MG2:1:83873:84103:-1 gene:Mycgr3G32211 transcript:Mycgr3T32211 description:"Putative uncharacterized protein"
>Mycgr3P88584 pep:known chromosome:MG2:1:110070:111032:-1 gene:Mycgr3G88584 transcript:Mycgr3T88584 description:"Putative uncharacterized protein"
>Mycgr3P88585 pep:known chromosome:MG2:1:112306:117149:1 gene:Mycgr3G88585 transcript:Mycgr3T88585 description:"Putative uncharacterized protein"
>Mycgr3P88586 pep:known chromosome:MG2:1:117450:118798:-1 gene:Mycgr3G88586 transcript:Mycgr3T88586 description:"Putative uncharacterized protein"
>Mycgr3P51803 pep:known chromosome:MG2:1:120032:120827:1 gene:Mycgr3G51803 transcript:Mycgr3T51803 description:"Putative uncharacterized protein"

Now, if we look at the headers from an alignment file generated by Orthofinder's trees_for_orthogroups.py of only Zymoseptoria_tritici.MG2.30.pep.all.fa, you will notice everything before all is the original file name and everything after is some part of the headers from the pep file. I don't know how Orthofinder decides which part of the header to insert into the orthogroup file. I had to manually find the pattern for my pep files and orthogroup files and apply that pattern to my CDS files.

.
.
.
>Zymoseptoria_tritici.MG2.29.pep.all_Mycgr3P28532
.
.
.

Depending on how one's peptide files are named and how the headers in those peptide files are organized, it can be a pain to find corresponding CDSs. So it really helps if all the files are named in a same and specific manner before Orthofinder is run. I believe Orthofinder drops the last three characters in the file name (such as .fa), adds an _ and then adds what I call the identifier (some part of the header from the pep files of each genome). Another thing that makes pulling out CDSs more difficult is how headers in the CDS files are named. For example, in my Zymoseptoria_tritici.MG2.30 CDS file, Mycgr3P28532 is really Mycgr3T28532 - P is replaced with T. So the script has to take that into account as well.

So, if you can, please rename your files and rerun Orthofinder (using as many threads as you can). Then my script will be super easy to use. Here are all the steps you will need to follow to get you CDSs.

A. Rename your pep file. You want to use a unique word in the new names for the script to work. Given your files end in `.fa`, here is how to rename the files quickly.

rename -v 's/.fa/_CHILE4DRELO.fa/' *

B. Then, you can rerun Orthofinder.

C. Concatenate all your CDS files into one gaint file. You can use wildcards to cat them. Please make sure you have ONLY the CDS files when you're cat'ing them.

cat *.cds.fa > ALLcds.fa

D. Change the following lines in the `proteinOrthos2dnaOrthos_TA.py`

Line 31 and 32

default_name = currentLine.split('CHILE4DRELO_')[1]
default_name = currentLine.split('CHILE4DRELO_')[0]

Line 38 through 41 based on the identifiers in your Orthogroup files

E. Make the script executable

chmod +x proteinOrthos2dnaOrthos_TA.py

F. Run the script. Please make you keep the order to the arguments intact

./proteinOrthos2dnaOrthos_TA.py PATH_TO_ORTHOGROUPS_DIRECTORY PATH_TO_ALLcds.fa

This will generate a bunch of files with CDS as the prefix followed by the orthogroup file name. The headers in these files contain three things: species name + orthogroup ID + entire header from the CDS file.

I hope this script helps. I would check the output files and make they have the same number of sequences as their corresponding protein orthogroup files. Please let me know if something does not make sense. Good luck!

drelo commented 8 years ago

Thanks for you thorough explanation I will try this. This walkthrough seems really good way to handle this as far as I can grasp it.

El sáb., 23 de ene. de 2016 a la(s) 12:11, Taruna notifications@github.com escribió:

Ah I see! Well, that makes things a bit complicated and here's why. I will start by explaining why I mean by identifiers. Also, you should know that this script was used for orthogroup files that contain single-copy orthologs. But it should work for any orthogroup file(s) so as long as you don't have duplicate headers in those files. I don't think you will but you may want to double check some of the headers and ensure they are unique (their sequences being paralogs won't affect anything). Okay, back to identifiers. The identifiers are the part of the headers from pep files that Orthofinder chooses to tack on to the headers in the orthogroup fasta files.

For instance:

You pep files might be named as such

Zymoseptoria_tritici.MG2.30.pep.all.fa Fusarium_solani.v2.0.30.pep.all.all.fa Grosmannia_clavigera_kw1407.GCA_000143105.2.30.pep.all.fa Talaromyces_aculeatus.ATCC_10409.v1.0.pep.all.fa

Let's look at some headers of Zymoseptoria_tritici.MG2.30.pep.all.fa

Mycgr3P32211 pep:known chromosome:MG2:1:83873:84103:-1 gene:Mycgr3G32211 transcript:Mycgr3T32211 description:"Putative uncharacterized protein" Mycgr3P88584 pep:known chromosome:MG2:1:110070:111032:-1 gene:Mycgr3G88584 transcript:Mycgr3T88584 description:"Putative uncharacterized protein" Mycgr3P88585 pep:known chromosome:MG2:1:112306:117149:1 gene:Mycgr3G88585 transcript:Mycgr3T88585 description:"Putative uncharacterized protein" Mycgr3P88586 pep:known chromosome:MG2:1:117450:118798:-1 gene:Mycgr3G88586 transcript:Mycgr3T88586 description:"Putative uncharacterized protein" Mycgr3P51803 pep:known chromosome:MG2:1:120032:120827:1 gene:Mycgr3G51803 transcript:Mycgr3T51803 description:"Putative uncharacterized protein"

Now, if we look at the headers from an alignment file generated by Orthofinder's trees_for_orthogroups.py of only Zymoseptoria_tritici.MG2.30.pep.all.fa, you will notice everything before all is the original file name and everything after is some part of the headers from the pep file. I don't know how Orthofinder decides which part of the header to insert into the orthogroup file. I had to manually find the pattern for my pep files and orthogroup files and apply that pattern to my CDS files.

. . .

Zymoseptoria_tritici.MG2.29.pep.all_Mycgr3P28532 . . .

Depending on how one's peptide files are named and how the headers in those peptide files are organized, it can be a pain to find corresponding CDSs. So it really helps if all the files are named in a same and specific manner before Orthofinder is run. I believe Orthofinder drops the last three characters in the file name (such as .fa), adds an _ and then adds what I call the identifier (some part of the header from the pep files of each genome). Another thing that makes pulling out CDSs more difficult is how headers in the CDS files are named. For example, in my Zymoseptoria_tritici.MG2.30 CDS file, Mycgr3P28532 is really Mycgr3T28532

P is replaced with T. So the script has to take that into account as well.

So, if you can, please rename your files and rerun Orthofinder (using as many threads as you can). Then my script will be super easy to use. Here are all the steps you will need to follow to get you CDSs. A. Rename your pep file. You want to use a unique word in the new names for the script to work. Given your files end in .fa, here is how to rename the files quickly.

rename -v 's/.fa/_CHILE4DRELO.fa/' *

B. Then, you can rerun Orthofinder. C. Concatenate all your CDS files into one gaint file. You can use wildcards to cat them. Please make sure you have ONLY the CDS files when you're cat'ing them.

cat *.cds.fa > ALLcds.fa

D. Change the following lines in the proteinOrthos2dnaOrthos_TA.py Line 31 and 32

defaultname = currentLine.split('CHILE4DRELO')[1] defaultname = currentLine.split('CHILE4DRELO')[0]

Line 38 through 41 based on the identifiers in your Orthogroup files E. Make the script executable

chmod +x proteinOrthos2dnaOrthos_TA.py

F. Run the script. Please make you keep the order to the arguments intact

./proteinOrthos2dnaOrthos_TA.py PATH_TO_ORTHOGROUPS_DIRECTORY PATH_TO_ALLcds.fa

This will generate a bunch of files with CDS as the prefix followed by the orthogroup file name. The headers in these files contain three things: species name + orthogroup ID + entire header from the CDS file.

I hope this script helps. I would check the output files and make they have the same number of sequences as their corresponding protein orthogroup files. Please let me know if something does not make sense. Good luck!

— Reply to this email directly or view it on GitHub https://github.com/davidemms/OrthoFinder/issues/8#issuecomment-174192685 .

Postdoctoral Associate

Instituto de Cs Ambientales y Evolutivas Universidad Austral de Chile Campus Isla Teja s/n Valdivia, Chile

tarunaaggarwal commented 8 years ago

Hi Drelo. I just wanted to let you know that I found an error in my script today and am working on fixing it. Just wanted to give you a heads up in case you are still planning on using it... I can put a comment after I update the script. Sorry and thanks!

drelo commented 8 years ago

Thanks for the update

El mar., 26 de ene. de 2016 a la(s) 21:29, Taruna notifications@github.com escribió:

Hi Drelo. I just wanted to let you know that I found an error in my script today and am working on fixing it. Just wanted to give you a heads up in case you are still planning on using it... I can put a comment after I update the script. Sorry and thanks!

— Reply to this email directly or view it on GitHub https://github.com/davidemms/OrthoFinder/issues/8#issuecomment-175312119 .

Postdoctoral Associate

Instituto de Cs Ambientales y Evolutivas Universidad Austral de Chile Campus Isla Teja s/n Valdivia, Chile

davidemms / OrthoFinder