Open MrTomRod opened 4 years ago
Hi
This looks nice, could you talk me through this a little more? From what I can see from the script it looks like it takes the 'description' attribute for each gene by reading the fasta file using Bio.SeqIO and uses anything in there as a possible name. Do you know how general purpose it is? What sources of fasta files does it give meaningful gene names for?
I think it's a nice idea. In designing OrthoFinder everything has to be general purpose so I've avoided making assumptions about the description lines in FASTA files--I only work with the sequence data, which I know will be in the form I expect. So a consideration for me would be what proportion of users would it work for and how much support I'd be called on to give if it didn't work for someone's dataset. I think if it is general purpose enough then options would include including it in the tools directory for OrthoFinder or I could pull together a page on the orthofinder website listing external tools people have developed for OrthoFinder and list it there.
I think another step I'd probably take if I included it in the tools directory would be to remove the dependencies on biopython, fire and pandas. I love using those libraries myself but for an OrthoFinder userbase I I'd generally try to reduce the dependencies as much as possible.
Let me know your thoughts on this.
Thanks! David
From what I can see from the script it looks like it takes the 'description' attribute for each gene by reading the fasta file using Bio.SeqIO and uses anything in there as a possible name.
Exactly! Then, it checks how frequent each name is. It's very fast and little can go wrong, since the fasta format is quite simple.
Do you know how general purpose it is?
No, but it was one of the first things I did with your output. It gives one a quick-and-dirty description of what the OGs role might be.
What sources of fasta files does it give meaningful gene names for?
I'm working with prokaryotes, and both Prokka as well as NCBI-annotated files (PGAP pipeline) give compatible output. These are the two most common ones, I think.
I think if it is general purpose enough then options would include including it in the tools directory for OrthoFinder or I could pull together a page on the orthofinder website listing external tools people have developed for OrthoFinder and list it there.
I think a list of links to tools is the best option because it makes clear that these scripts aren't your responsibility and you don't have to do be involved if the scripts are updated.
I'd generally try to reduce the dependencies as much as possible.
If you want me to, I'd be happy to remove the dependencies.
Btw: If you looked at my repo, you will have noticed that I also made a script that mimicks roary_plots.py
. roary_plots.py
was also not made by the roary-devs but a guy called Marco Galardini.
Hi
I had a play with the script using Ensembl proteomes but I couldn't get it to work. If it could support those then I think that could increase its usefulness further. But, as an added complication, with these I advise users to run a script which extracts only the longest transcript variants for input into OrthoFinder: https://davidemms.github.io/orthofinder_tutorials/running-an-example-orthofinder-analysis.html
As a result the OrthoFinder results use only the gene name to refer to each gene, e.g. >ENSDARG00000098423.2
instead of
>ENSDARP00000136500.1 pep chromosome:GRCz11:2:31885581:31886019:-1 gene:ENSDARG00000098423.2 transcript:ENSDART00000170804.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:trgv2 description:T cell receptor gamma variable 2 [Source:ZFIN;Acc:ZDB-GENE-051115-14]
Obviously your script could be provided with these original fasta files with the full description to get the gene names but it would need to know how to take either "ENSDARG00000098423.2" or "ENSDARP00000136500.1" and find the full description in the fasta file.
The other source I use regularly myself is Phytozome, I don't know how that ranks in terms of overall popularity.
I'll try and find some time in the coming weeks to find out about and put together a list of tools people have developed for OthoFinder and will then put it up on the webpage.
All the best David
I don't fully understand...
The input fastas only contain the gene name (ENSDARG00000098423.2
)?
Could you provide me with such a file?
How would you design the solution? Add an ensembl-mode which translates the gene name to a description using their API? (Or automatically detect ensembl-ids since they follow a simple pattern.)
Putting aside the primary_transcipts.py script for a minute. If I download a few proteomes from Ensembl and run OrthoFinder on them and then the orthogroup_to_gene_name.py script on the results the lines I get look like this:
Orthogroup Best Gene Name Gene Name Occurrences
...
...
OG0024176 pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040971.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527] {'pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040971.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]': 1, 'pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040959.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]': 1}
OG0024177 pep primary_assembly:Xenopus_tropicalis_v9.1:8:15991645:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049678.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857] {'pep primary_assembly:Xenopus_tropicalis_v9.1:8:15991645:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049678.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]': 1, 'pep primary_assembly:Xenopus_tropicalis_v9.1:8:16024975:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049687.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]': 1}
so it's not pulling out just the gene names for Ensembl file, I think it's giving the full line? I think getting it to work with these files is the main challenge.
The primary transcripts script is only a minor complication. If the user ran the primary_transcripts.py script their input to OrthoFinder would only contain the gene names (ENSDARG00000098423.2
) but they would still have a folder of the original fasta files*. And so this folder would be the directory they provide to your script rather than the directory of processed fasta files provided to OrthoFinder as input. So then your script would just have to behave more or less as it currently does and extract the info from the fasta files by looking up the identifier ENSDARG00000098423.2.
*E.g Ensembl fasta file: ftp://ftp.ensembl.org/pub/release-100/fasta/danio_rerio/pep/Danio_rerio.GRCz11.pep.all.fa.gz
This is awsome! The outputs of Orthofinder have been slightly updated now, do you figure you could update your scripts to accommodate that? I tried a bit but struggled..
This is awsome! The outputs of Orthofinder have been slightly updated now, do you figure you could update your scripts to accommodate that? I tried a bit but struggled..
Glad someone else finds it useful. If your fasta files contain descriptions, it should work. I used it recently on the output of the most recent OrthoFinder without an issue. If you have a problem, create an issue on my repo...
Also, I wrote a tiny how-to in README.md, in case you missed it.
I'll take some time today to run @davidemms tutorial and get my script working with that output.
It'd be great to see it working on the tutorial data, Ensembl is widely used and I'd love to get to try it on some of my analyses!
I had an accident and thought I'd spend some of my free time working on non-stressful projects like this, but I just learned that this (screen time) may slow my recovery or even incur long-time consequences.
So I won't be working on it until for the immediate future, sorry!
@davidemms Btw, I've been incorporating OrthoFinder into a PhD sub-project, a tool I named OpenGenomeBrowser. If you're interested, test out an early prototype here! This is where OrthoFinder comes into play: Trees view.
Safe recovery @MrTomRod ! Maybe I can adapt the scripts myself (also the roary plots, I meant). I'f I do I'll write here!
@MrTomRod I'm getting an error:
KeyError: "['OG' 'Gene Tree Parent Clade'] not found in axis"
Does this ring any bells?
It does! I forgot to document the recent changes I made to the script. Sorry. It should work now if you follow the manual:
https://github.com/MrTomRod/orthofinder-tools/blob/master/README.md
Awesome, I'm going to check it out now.
Am I using the wrong file?
In [2]: df = pd.read_csv("orthofinder_output/Orthogroups/Orthogroups.tsv", sep="\t")
df
In [3]: df.columns
Out[3]:
Index(['Orthogroup', 'GCA_002571385.1', 'GCA_003704095.1', 'GCF_000222465.1',
'GCF_002042975.1', 'GCF_002571385.1', 'GCF_003704095.1',
'GCF_004143615.1', 'Mussismilia_hispida'],
dtype='object')
Also, where do the annotations come from as I'm not seeing them in the dataframe?
Thanks!
Yes, you have to use the new N0 file.
OrthogroupToGeneName(
n0_tsv='orthofinder_output/Phylogenetic_Hierarchical_Orthogroups/N0.tsv',
index_column='HOG'
).run(
fasta_dir=PATH_TO_ORTHOFINDER_FASTAS,
file_endings='faa',
out='path/to/output.tsv'
)
The annotations come from the fasta files. (PATH_TO_ORTHOFINDER_FASTAS
)
Ok cool, I just got it to run! So basically, we provide annotations as the description? I have a de-novo genome so I'll need to add my own annotations for some of these.
btw, thank you for creating this script!
Ok cool, I just got it to run! So basically, we provide annotations as the description?
for each gene that was assigned to an orthgroup, the script takes the annotation from the protein fasta. then, it selects the most common annotation as the orthogroups annotation.
I have a de-novo genome so I'll need to add my own annotations for some of these.
keep in mind that with something like GO-terms/EC-numbers/KEGG-annotations, the logic of my script would not suffice. there, the logic should be: assign a GO-term to the orthgroup if at least 50% (or whatever) of the genes in the orthogroup have this GO-term.
btw, thank you for creating this script!
no worries, i made it quick and dirty for myself and decided to share it.
Hello MrTomRod,
Ok cool, I just got it to run! So basically, we provide annotations as the description?
for each gene that was assigned to an orthgroup, the script takes the annotation from the protein fasta. then, it selects the most common annotation as the orthogroups annotation.
I have a de-novo genome so I'll need to add my own annotations for some of these.
keep in mind that with something like GO-terms/EC-numbers/KEGG-annotations, the logic of my script would not suffice. there, the logic should be: assign a GO-term to the orthgroup if at least 50% (or whatever) of the genes in the orthogroup have this GO-term.
btw, thank you for creating this script!
no worries, i made it quick and dirty for myself and decided to share it.
I used you scripts for ploting. I can get the results with the following warning. /home/AAFC-AAC/fuf/miniconda3/envs/orthofinder/lib/python3.8/site-packages/fire/core.py:672: DtypeWarning: Columns (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89) have mixed types.Specify dtype option on import or set low_memory=False. component = fn(*varargs, **kwargs) But the results look good. Thanks, Fuyou
Hi!
I wrote a quick-and-dirty script (
orthogroup_to_gene_name.py
) that takesOrthogroups.tsv
and turns it intoOrthogroup_BestNames.tsv
. This file looks like this:Orthogroup Best Gene Name Gene Name Occurrences OG0000000 ABC transporter ATP-binding protein {'Fluoroquinolones export ATP-binding protein/MT2762': 2, 'Sulfate/thiosulfate import ATP-binding protein CysA': 2, ...} OG0000001 IS30 family transposase, partial {'IS30 family transposase': 144, 'IS30 family transposase, partial': 518, 'Integrase core domain protein': 4, 'hypothetical protein': 6} OG0000002 IS5/IS1182 family transposase, partial {'IS5 family transposase': 106, 'IS5/IS1182 family transposase, partial': 326, 'IS5 family transposase, partial': 81, ...} I find it quite useful and since it takes no time to generate, I wanted to suggest you add it to your tool!
Best, MrTomRod
Hello MrTomRod,
I got the results from OrthoFinder, I did not find the file directory about Phylogenetic_Hierarchical_Orthogroups/N0.tsv.
I tried a few times. I still did not get this. I am a little confused.
Could you tell me how I can get this?
Thanks,
Fuyou
You can't find the folder Phylogenetic_Hierarchical_Orthogroups
?
Are you using the latest version of Orthofinder? Because older versions don't create it.
Here are the instructions for my script.
You can't find the folder
Phylogenetic_Hierarchical_Orthogroups
?Are you using the latest version of Orthofinder? Because older versions don't create it.
Here are the instructions for my script.
I got this question. Because I did not update my orthofinder. I just want to tell you about this. Thanks, Fuyou
You can't find the folder
Phylogenetic_Hierarchical_Orthogroups
?Are you using the latest version of Orthofinder? Because older versions don't create it.
Here are the instructions for my script.
Hello MrTomRod, I tried this script. But I get the result is as following. I can not find the functional category. HOG Best Gene Name Gene Name Occurrences N0.HOG0000000 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1}) N0.HOG0000001 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1}) N0.HOG0000002 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1}) N0.HOG0000003 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1}) N0.HOG0000004 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1}) N0.HOG0000005 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1}) N0.HOG0000006 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1}) N0.HOG0000007 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1}) N0.HOG0000008 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1}) N0.HOG0000009 L022_000673 Counter({'L022_000673': 1, 'LREF_001506': 1}) N0.HOG0000010 L022_000673 Counter({'L022_000673': 1, 'LREF_001506': 1}) N0.HOG0000011 L022_000673 Counter({'L022_000673': 1, 'LREF_001506': 1}) N0.HOG0000012 L022_001622 Counter({'L022_001622': 1, 'LREF_000642': 1}) N0.HOG0000013 L022_001622 Counter({'L022_001622': 1, 'LREF_000642': 1}) N0.HOG0000014 L022_001622 Counter({'L022_001622': 1, 'LREF_000642': 1}) N0.HOG0000015 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1}) N0.HOG0000016 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1}) N0.HOG0000017 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1}) N0.HOG0000018 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1}) If I want to get the functional catergory, what can I do? I read the new tutorials, I did not find any new. My all protein sequences are de novo annotation. So I should download some sequences from database with functional catergory? I am much appreciated for your helping. Best, Fuyou
Please show me the first 5 lines of one of your fasta files.
Please show me the first 5 lines of one of your fasta files.
ps/N0.tsv N0.HOG0000001 OG0000000 n9 L022_001102-T1 LREF_000140-T3 N0.HOG0000002 OG0000000 n6 L022_001102-T3 LREF_000140-T6, LREF_000140-T1 N0.HOG0000003 OG0000000 n8 L022_001102-T5 LREF_000140-T2 N0.HOG0000004 OG0000001 n1 L022_006069-T1 LREF_005195-T4 N0.HOG0000005 OG0000001 n9 L022_006069-T3 LREF_005195-T1 N0.HOG0000006 OG0000001 n8 L022_006069-T5 LREF_005195-T3 Thanks, Fuyou
no, the first 5 lines of one of your FASTA files.
L022_000001-T1 L022_000001 MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTSA AVWNIAGTYIKAKSWLPGMELESDTIQYVKRISSIPIPEIIFSWVDVEWDRTFLVLKALKGRTLDQVWESLSANRRMEIA DKVAQFCRTLALSTSEMLMTANGNAALEPFLTSLPPASEPSWKPRNLGPFSSCQLRSYLSKPTVLDDIDLFHFYHADLGP SNVIVTDDGSIVGIIDWESAAFYPKFWLGTKPLVSVGFTVRGAEKRAWAVLLASSLERVGFSPDLDKYEAWKKAIGK L022_000002-T1 L022_000002 MDARPVLKVFSVLRCLDSLRSSRPPLQFCPSALSPTMPSTFTVLSAGLVAASSGAFAQKVPETAQVIDQKLFNVLDVVPP PSVFNGSTLFTWPGVTAESLTAKPFHIYDEEFYDVIGSDPTLTLIASSESDPIFHEAVTWHPPRDEVFFVQNAGAPAAGT GLNKSSIVQKISLAEAEAVRTGKLDAVTVTVVNTSNPQVINPNGGTNYKGQIIFAGEGQGADITSALYLMNPEPPYNTTT LLNNYFGRQFNSLNDLVVHPKNGDVYFTDTLYGFLQDFRPPPGLRNQVYRLNVETGAVTVVADDFTLPNGITINPEGDRA YVTDTGIALGYYGRNLSSPASVYSFDVNEDGTWENRKTFAYTPAFVPDGVHTDSEGRVYTGCGDGVHVFNTSGKLIGKIY TGMSAANFQFAGKGRMIITGQTKLFYVTLAAEGAPLT L022_000003-T1 L022_000003 MMYRVKAVYDYSSPHEDDLDFKAGQIISVTEEEGDDWYVGEYIDDTGTKRDGLFPRNFVEKYEPEPPPRPNRASRYKTTE QPAVQAAPPIPDTPQPQEPDQQPEPEPEPEPEPEPAPEPAPEPEPVRHEEPEPPRAQPPPLEVPAAAKTVPLPMSPMSPA SASTRRAPELPAQAPLQQESAAKPAPASKKPPPPIAAKSNAFRDRIAAFNAPAAAPIQPFKPAGAPTNFIKKPFVAPPPS RNAYVPPVREAPPVQTYRREEDPEIAERQAQDNDAAEKAGLVAHDAPAQAATEEESQPKVSLKERIALLQKQQQEQAIRA AAMHKEKPRRPPVKKRTESIDAHPEDSEDTGLERVASGASKERISMDQARPPRVSHDIKSPDEHQRHREILSDANDADQS AAGDTEDAGGESTSVEGDEERNNVQQTQPSLPIRAAAAPTQEPDVGDEQDVEPEEGEEEEEEDEMDAETRRKLELRERMA KMSGGMGMPGMFGGIPMGGLPPPKKKKPILESKRVDGDEEPSLPQQRVAMLPMPGMPTVKSPEQEDRQVLQVEKEDETAP PITSAHRPDEVPDVEDVQPHLTDRTPTGEQPPPVPMEKRPLAPPVPSATRPAPPAPPQVLSPGPGSESGDEMTDAPSALS PNTPSAPFASPTKRNSYFNSDNMRQAPSIPLQAARPSSPAVSRPPPPPPPTQAPPSRHGEEPPRRLDRNEGETDYEGDYD TDIASGATHKDALKSHAREASMDGSTTEEPSPARSPTAPHGLPPFPPHLPRAVPPPPPHQPPSRSSIDVPRGAPPLPPVR PTTNDVQLDEDDDEYDPYRYAGPPVASPPPMPRAVPPPPQSQPPQPHNKPPPPMPPSMPPAVHRQAPESDEDDLYSAPQP RKSHDRPRPPPPPQAAPHQPAPPPAHQHAPPPPPRQFAPPLPPQERPAPPPPPDRSTMAAPGGEPQPRMPIGRKSLDANR ALGSRPSMDQSRPGLGADFIARDIDLGMSSHWWTQPKMLPPSLQGRKDILVDMQESTSGNTVEKLVKVLHLDYSQTTVSA RFDLSNVSDVELEQRNDPPPPTQRPDQLEAAYEQYGRAIAKAVELKQNTVVGAGTPWSLVDELLKPLADALPPVSTRAFG ALVYANLANASTQSFDEIRPGDIITFRNARFQGKTGPMHAKYAVDVGKPDHVGVVVEWDGSKKKVRVWEQGREHKKVKQE SFRMGDLRSGEVRVWRVVGRSWVGW L022_000004-T1 L022_000004 MPPKSILKNSTTTVSTTDIPPGRKPANPRHLDVAIHHANILEDRKRVEAAVLDAIIMLMDFPLTPSANAKRPSPSDAQLF IDKLNIFQPADYDALIEERNLTDKCGYALCSNPKQKARSTARKQFIDTDHGVEIVDRKVLEVWCSSDCARRALYVKVQLK EEPAWLRQGGFADKIELMVDNTEEHDNALPLPLKKEEDQSTNAQDDDVSAAWAAQEEAMAELALERGEQPGQFTKANPGL VKEDIMERATSNAPPKPPTLVAQDMDNTTMAIEGHVPRIDRKRNDDDEDEEDAQDWDKHLPG L022_000005-T1 L022_000005 MSLSRQSLSPTANLLRNSRLFSLPNPLPKPPVGESFRSGITKASDTATLPYPTQQAIATTKSSLARGDWGLKRPLPQRSY LVQVSDPVLRVNQLDTIEHVTDFDSAADHVRTRQKWEAMGVPMMKGMTQMRDKDLSGTPPAGAFEIRDDTTSYDTELGLD ESGLYLKALKDNVKTQADATKKAFDTLWMNKRQAQEAEWDAQGASGDGRVEKIKAFEQEQAQARKQLNKQLTRLRKSDFT PFTPPPVDAAVHNTKRWKHDGPWLPGMSADDFLAYLNKEITKRKPEFHRYLVEFVKNEIYTTRRLAASQAGEAPPMDMAE AEAWHARQEKQWRNITPDQIDAGIKSLRKETANNPLGSKLVTKLILPFLRLPAIKFKYTSYAEDASTRAIDQYQFDQETA PLSTHPSAGLGYLRTRSYITNHPILGPQAQPTPVTARVIQPRTAGTSKQVYARLGVGGFVANDQYRSTDVSSGNRAGVNN ARDVETIDIDTPGGKKLTVQPLFGSVTNDGRIHIKVKRSQGPEVQVAKGALDDRPPVREFARVERIERFASGKESGVKEL DEFSGRAKMLSGFLEGLGEGQGKGE
Thanks Fuyou
The reason is that your fastas are not functionally annotated.
My script only works if your fastas look something like this:
>L022000001-T1 L022_000001 Description Gene 1
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS
>L022000001-T1 L022_000001 Description Gene 2
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS
I got it. Thanks, Fuyou
On Thu, Aug 27, 2020 at 8:14 AM Thomas Roder notifications@github.com wrote:
The reason is that your fastas are not functionally annotated.
My script only works if your fastas look something like this:
L022000001-T1 L022_000001 Description Gene 1 MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS L022000001-T1 L022_000001 Description Gene 2 MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/davidemms/OrthoFinder/issues/362#issuecomment-681976045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3JCKCHUEJSBVI6NFZ4DNTSCZS5LANCNFSM4LSPLHOA .
-- Fuyou Fu, Ph.D. Department of Botany and Plant Pathology Purdue University USA
Any such luck for Proteomes downloaded from uniprot?
Any such luck for Proteomes downloaded from uniprot?
This is an example header:
>tr|A0A151G869|A0A151G869_LACPN Bacteriocin immunity protein OS=Lactiplantibacillus plantarum OX=1590 GN=AYO51_00560 PE=4 SV=1
If you change these headers by removing everything after OS=
, it should work fine.
Example:
header.rsplit(' OS=', maxsplit=1)[0]
Result:
'>tr|A0A151G869|A0A151G869_LACPN Bacteriocin immunity protein'
Hello MrTomRod, I tried the script orthofider_plot.py (based on roary plot), but I I'm getting an error. I think it´s relate to library fire
I run this commnad python3 orthofinder_plots.py TREE ~/Species_Tree/SpeciesTree_rooted.txt ORTHOGROUPS_TSV ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv OUT output Could you help me how I can get this?
Follow the error.
Traceback (most recent call last):
File "orthofinder_plots.py", line 206, in
Thanks! All the best
Graci
Hey Garci
Sorry, my script is not great as I did not invest too much time into it. It's a quick-and-dirty rewrite of Marco Galardinis script.
The error you're seeing is not related to the Fire library. You have to read Python stack traces from the bottom up. Maybe this is useful to you: Understanding the Python Traceback.
The relevant line is assert os.path.isdir(out)
. The problem is that the variable out
does not point to an existing folder. Usage of my script is explained here.
python3 orthofinder_plots.py --tree ~/Species_Tree/SpeciesTree_rooted.txt --orthogroups_tsv ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv --out output
Maybe all you have to do is mkdir output
?
No errors! But I just wanted to say THANK YOU!
Just to know, how does it calculate the core, soft-core, shell, and cloud genomes?
Glad you find it useful.
I just copied the logic from Marco Galardini / roary_plots:
CORE, SOFT, SHELL = (n_strains * f for f in [.99, .95, .15])
core = ((og_count >= CORE) & (og_count <= n_strains)).sum()
softcore = ((og_count >= SOFT) & (og_count < CORE)).sum()
shell = ((og_count >= SHELL) & (og_count < SOFT)).sum()
cloud = (og_count < SHELL).sum()
Hope that's clear enough.
I think it's pretty clear.
Last thing, Is there a proper way to cite your script? I found this (https://guides.libraries.uc.edu/citing/code) but I would need your data.
Thanks, that's very nice.
So:
Roder, T (2022) MrTomRod/orthofinder-tools computer program (Version 0.0.2). https://github.com/MrTomRod/orthofinder-tools
Hi!
I wrote a quick-and-dirty script (
orthogroup_to_gene_name.py
) that takesOrthogroups.tsv
and turns it intoOrthogroup_BestNames.tsv
. This file looks like this:I find it quite useful and since it takes no time to generate, I wanted to suggest you add it to your tool!
Best, MrTomRod