drostlab / orthologr

Genome wide orthology inference and dNdS estimation
https://drostlab.github.io/orthologr/
GNU General Public License v2.0
88 stars 27 forks source link

h(simpleErrpr(msg, call)) error #41

Open DanielleMStevens opened 1 year ago

DanielleMStevens commented 1 year ago

Hi! I just first wanted to say thank you so much for developing this package. It is quite handy and overall greta resource for so many. I've been using the blasts comparison to confirm some previous classification for grouping a protein family. So far it has worked wonderfully. Though for one fasta file, I ran into the follow:

Rhodo_T1_T2 <- as.data.frame(orthologr::blast(query_file = '~/Documents/Mining_MAMPs/Mining_Known_MAMPs/Analyses/Catagorizing_CSPs/CSP_types/Typing_fasta_files/Rhodococcus_Type1.fasta', subject_file = '~/Documents/Mining_MAMPs/Mining_Known_MAMPs/Analyses/Catagorizing_CSPs/CSP_types/Typing_fasta_files/Rhodococcus_Type2.fasta', seq_type = 'protein'))

Running blastp: 2.11.0+ ...

Building a new DB, current time: 06/16/2023 17:03:59 New DB name: /tmp/RtmpvS42vk/_blast_db/blastdb_Rhodococcus_Type2.fasta_protein.fasta New DB title: blastdb_Rhodococcus_Type2.fasta_protein.fasta Sequence type: Protein Deleted existing Protein BLAST database named /tmp/RtmpvS42vk/_blast_db/blastdb_Rhodococcus_Type2.fasta_protein.fasta Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 5 sequences in 0.000640154 seconds.

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': File blastresult_Rhodococcus_Type1.fasta.csv could not be read correctly. Please check the correct path to blastresult_Rhodococcus_Type1.fasta.csv or whether BLAST did write the resulting hit table correctly.

Type1 fast file is fine when comparing against itself or the other two types. Type 2 errors against Type1 and Type3 but runs ok comparing against itself. I've tried reuploading the fasta files but that did nothing. Any ideas? Thanks in advance!

HajkD commented 1 year ago

Hi @DanielleMStevens

Thank you very much for reaching out.

This seems quite a strange behavior indeed.

Would it be possible to paste here the top 5 sequences (with header) from the Type1 and Type2 fasta files, so that I can reproduce the error and start trouble shooting?

How many sequences are roughly in these files to begin with?

The error seems to stem from the fact that in this Type1 vs Type2 search constellation no blastresult_* file is being produced (blastresult_Rhodococcus_Type1.fasta.csv could not be read correctly. )

With very best wishes, Hajk

DanielleMStevens commented 1 year ago

Hi Hajk! Wow, thank you so much for responding so quickly! So for more context, I'm trying to group the gene families to fill out an all by all comparison of each. From this I should be able to estimate (along with some synteny analyzes of the gene structure) to improve ortholog grouping across bacterial genera. From this, I am planning on running selection tests and potentially population genetic stats if I am able to. All of this was to prevent comparing too diverged paralogs were I could be inflating/deflating the calculated values. By looping all the files, I have been able to skip past the problematic combinations.

For example, the below diagram shows 0 values were some of the files failed (ex. Type1 vs. Type2 Rhodocococcus types).

Screen Shot 2023-06-18 at 2 28 07 PM

To answer your questions, maybe, though all type files are named consistently in terms of style. Here's the top 5 sequences from Rhodococcus_TYpe1.fasta:

>WP_032368157.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_A3b|2690
MLSTGKVIRFDEFKGYGFVAPDEGGEDVFIHVNDLDFDKRLLAPGVRVEYVAVDGDRGLKATEVQIIDGPTSTALVTRHKPDVPFHNTAPVSGHQRHDDAGADVLSEADFTTELTEQLLTSNGTLTAAQILDVRSAVVRLAHDHRWIQG
>WP_032402570.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_04-516|2697
MLSTGKVIRFDEFKGYGFVAPDEGGEDVFIHVNDLDFDKRLLAPGVRVEYVAVEGDRGLKATQVQIIDGPASTALVTRRKPDVPFHDIPVSEHQQLEDVICDVLSEGDFTAELTEALLTANGTLSATQILDIRSAILRLAHDHKWIEN
>WP_032368157.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_A78|2712
MLSTGKVIRFDEFKGYGFVAPDEGGEDVFIHVNDLDFDKRLLAPGVRVEYVAVDGDRGLKATEVQIIDGPTSTALVTRHKPDVPFHNTAPVSGHQRHDDAGADVLSEADFTTELTEQLLTSNGTLTAAQILDVRSAVVRLAHDHRWIQG
>WP_032368157.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_GIC26|2718
MLSTGKVIRFDEFKGYGFVAPDEGGEDVFIHVNDLDFDKRLLAPGVRVEYVAVDGDRGLKATEVQIIDGPTSTALVTRHKPDVPFHNTAPVSGHQRHDDAGADVLSEADFTTELTEQLLTSNGTLTAAQILDVRSAVVRLAHDHRWIQG
>WP_032368157.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_GIC36|2724
MLSTGKVIRFDEFKGYGFVAPDEGGEDVFIHVNDLDFDKRLLAPGVRVEYVAVDGDRGLKATEVQIIDGPTSTALVTRHKPDVPFHNTAPVSGHQRHDDAGADVLSEADFTTELTEQLLTSNGTLTAAQILDVRSAVVRLAHDHRWIQG

And here are the first 5 sequences from Rhodococcus_Type2.fasta:

>WP_114332301.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_A21d2|31078
MNSNYVAIDGPDIPGPRDAASTGTNCSITTLSMDMIDSTSQRYSKSIMSSNTEFMKMYDVWSKHLDAAPFKFVEKFSGDGLHITADSMHAGEVINLAMAMLHEFEKLGATAGGQPMGEIDFQVRAGIATGPAFRFRPKHGGNVDHLSSVCDRSARLCSAASAQALFVDPHTVNAANMTLVTSPLGDMLSRTPDDYLGDKQSVQLKGIPEPVPYFELLWSKGRFGVRSSVVTEISSAPTPPPPGSPSPAAARSSTGSDQLIGTTKVWMSDRGYGFVTGHDGEDFHVSSRALLYGEDAEHLRTGTRVVFLPADAAEAGKSRRAFSVVLVGQEAEGRVSFVHAEKPFGFIAVADDHGRELRFHMVVDAALRTALKAGDEVGFQVALDASGTRARAINVGKLGEGDLAVA
>WP_114332301.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_spp_14-2496-1d|31535
MNSNYVAIDGPDIPGPRDAASTGTNCSITTLSMDMIDSTSQRYSKSIMSSNTEFMKMYDVWSKHLDAAPFKFVEKFSGDGLHITADSMHAGEVINLAMAMLHEFEKLGATAGGQPMGEIDFQVRAGIATGPAFRFRPKHGGNVDHLSSVCDRSARLCSAASAQALFVDPHTVNAANMTLVTSPLGDMLSRTPDDYLGDKQSVQLKGIPEPVPYFELLWSKGRFGVRSSVVTEISSAPTPPPPGSPSPAAARSSTGSDQLIGTTKVWMSDRGYGFVTGHDGEDFHVSSRALLYGEDAEHLRTGTRVVFLPADAAEAGKSRRAFSVVLVGQEAEGRVSFVHAEKPFGFIAVADDHGRELRFHMVVDAALRTALKAGDEVGFQVALDASGTRARAINVGKLGEGDLAVA
>WP_196248497.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_A3b|31075
MNSNYVAIDGPDIPGPRDAASTGTNCSITTLSMDMTDSTPQRYSKSIMSSNTEFMKMYDVWSKHLDAAPFKFVEKFSGDGLHITADSMNAGEVINLAMAMLHEFEKLGATVGGQPMGEIDFQVRAGIATGPAFRFRPKHGGNVDHLSSVCDRSARLCSVASAQALFVDPHTVNAANMTQVTSPLGDMLNRAPDDYLGDKQSVQLKGIPEPVPYFELLWSKGRFGVRSSVVTEISSTPTPPPGAPSPSAARSSAGSDQLVGTTKVWMSDRGYGFVTGHDGEDFHVSSRALLYGEDAEHLRTGARVVFLPADAAEAGKSRRAFSVVLVGQEAEGRVSFIHAEKPFGFIAVGDDHGRELRFHMVVDAALRTALKAGDEVGFQVALDASGTRARAINVGKLGEGDLAVA
>WP_196248497.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_04-516|31076
MNSNYVAIDGPDIPGPRDAASTGTNCSITTLSMDMTDSTPQRYSKSIMSSNTEFMKMYDVWSKHLDAAPFKFVEKFSGDGLHITADSMNAGEVINLAMAMLHEFEKLGATVGGQPMGEIDFQVRAGIATGPAFRFRPKHGGNVDHLSSVCDRSARLCSVASAQALFVDPHTVNAANMTQVTSPLGDMLNRAPDDYLGDKQSVQLKGIPEPVPYFELLWSKGRFGVRSSVVTEISSTPTPPPGAPSPSAARSSAGSDQLVGTTKVWMSDRGYGFVTGHDGEDFHVSSRALLYGEDAEHLRTGARVVFLPADAAEAGKSRRAFSVVLVGQEAEGRVSFIHAEKPFGFIAVGDDHGRELRFHMVVDAALRTALKAGDEVGFQVALDASGTRARAINVGKLGEGDLAVA
>WP_196249598.1|csp22_consensus|Full_Seq|Rhodococcus|Rhodococcus_fascians_GIC36|31077
MNSNYVAIDGPDIPGPRDAASTGTNCSITTLSMDMTDSTPQRYSKSIMSSNTEFMKMYDVWSKHLDAAPFKFVEKFSGDGLHITADSMNAGEVINLAMAMLHEFEKLGATVGGQPMGEIDFQVRAGIATGPAFRFRPKHGGNVDHLSSVCDRSARLCSVASAQALFVDPHTVNAANMTQVTSPLGDMLNRAPDDYLGDKQSVQLKGIPEPVPYFELLWSKGRFGVRSSVVTEISSTPTPPPGAPSPSAARSSAGSDQLVGTTKVWMSDRGYGFVTGHDGEDFHVSSRALLYGEDAEHLRTGARVVFLPADAAEAGKSRRAFSVVLVGQEAEGRVSFIHAEKPFGFIAVGDDHGRELRFHMVVDAALRTALKAGDEVGFQVDLDASGTRARAINVGKLGEGDLAVA

The number of files for each heavily varies on the type, with the lowest being three sequences and the most being 3000K+. I don't think it is a size constraint entirely as Xanthomonas_Type1.fasta is OK with 3790 sequences. Let me know if there is any other info I can provide that is of use.

Thank you so much!! Danielle

HajkD commented 1 year ago

Hi Danielle,

Thank you very much for your detailed description!

Just to rule out the obvious case: Whenever you have comparisons with lower-bound sequences (towards 3 seqs), could it be that there are simply no hits found in these pairwise cases (given the threshold settings you choose for blasting)? If so, I think I don't have an error-capture routine implemented if no hits were found, because in such cases simply no blastresult_*.csv file will be generated.

If it's not this obvious case, then I will start digging now.

With many thanks and very best wishes, Hajk

DanielleMStevens commented 1 year ago

Apologies for the slow response. After further investigation, you are right. The low pairwise hits returned no file due to the default parameters requiring an E-value of 1e-5. Once I lower them (to 1e-1), I was able to return a file (which I wanted even if the values were low).

Moving forward, I think a simple fix to help future users would be to add to the original error a warning that lack of file may be due to low/no hits and to check if the default parameters are appropriate. Or something along those lines as the current error focuses on file path and write out functionality, which the latter it is but the way it is written, I don't think I would have thought to check the default parameters.

Thanks again for your quick help on this!!