b'BLAST options error: Please provide a database name using -out\n'

AnthonyRish12 commented 8 months ago

Hello, I am trying to use GCsnap to analyze a clustered .clan file that I prepared using MMseq2 and clan from MPI bioinformatics toolkit. The example test worked fine, but now I get stuck on step 2: Finding protein families (may take some time depending on the number of flanking sequences taken)

Command line: GCsnap -targets 4120184_1.clans -user_email hidden -ncbi_api_key hidden -get_taxonomy True -operon_cluster_advanced True -get_pdb True -get_functional_annotations True -interactive True

... Doing all against all searches with psiblast ... ... Making BLAST database b'BLAST options error: Please provide a database name using -out\n' Traceback (most recent call last): File "/Users/user/anaconda3/envs/GCsnap/bin/GCsnap", line 8, in sys.exit(main()) File "/Users/user/anaconda3/envs/GCsnap/lib/python3.8/site-packages/gcsnap/GCsnap.py", line 4319, in main all_syntenies, protein_families_summary = find_and_add_protein_families(all_syntenies, out_label = out_label, num_threads = n_cpus, num_alignments = num_alignments, max_evalue = max_evalue, num_iterations = num_iterations, blast = blast, mmseqs = mmseqs, min_coverage = min_coverage, default_base = default_base, tmp_folder = tmp_folder, method = method) File "/Users/user/anaconda3/envs/GCsnap/lib/python3.8/site-packages/gcsnap/GCsnap.py", line 3745, in find_and_add_protein_families distance_matrix, ordered_ncbi_codes = compute_all_agains_all_distance_matrix(in_syntenies, out_label = out_label, num_threads = num_threads, num_alignments = num_alignments, max_evalue = max_evalue, num_iterations = num_iterations, min_coverage = min_coverage, method = method, mmseqs = mmseqs, blast = blast, default_base = default_base, tmp_folder = tmp_folder) File "/Users/user/anaconda3/envs/GCsnap/lib/python3.8/site-packages/gcsnap/GCsnap.py", line 566, in compute_all_agains_all_distance_matrix sequences_database = make_blast_database_from_fasta(flanking_fasta, blast = blast) File "/Users/user/anaconda3/envs/GCsnap/lib/python3.8/site-packages/gcsnap/GCsnap.py", line 433, in make_blast_database_from_fasta if 'BLAST engine error' not in stderr: TypeError: a bytes-like object is required, not 'str'

It seems as though it wants me to change the -out file of the psiblast command when I run GCsnap. I have tried to update all of the -out_labels\format settings for GCsnap but this doesn't fix the issue.

I have installed all of the packages listed on the mainpage as well as ncbi-blast-2.15.0+ for blastp and psiblast.

Any help would be appreciated since I am not an experienced python user.

JoanaMPereira commented 8 months ago

Dear Anthony,

Thank you for using GCsnap and reporting this issue.

Unfortunately, I was not able to reproduce it. So that we can find why you get this error, I have some questions:

Did you setup GCsnap as described in the Installation section of the README? This should set up a python environment where all required modules (including blast) are all compatible with the GCsnap workflow. If you installed the modules over a pre-existing python environment, this could be the reason. If that is the case, I would suggest setting up a GCsnap-specific environment as described in the README.
Did you try to use the mmseqs method (using the -all-against-all_method option)? If you use mmseqs, do you have a similar/related error?

Best wishes Joana

AnthonyRish12 commented 8 months ago

Hi Joana,

I got it to work by adding my clans file directly to the directory where GCsnap was installed on my computer. Previously I was trying to grab the file from my Doucuments directory. I did not use the mmseqs method, only the default psiblast so far. However, now I have new issues. The first isn't really that important, I just want to know if it can be fixed.

Error 1 (for a 1000+ sequence alignment/clustered file):

Making operon/genomic_context blocks figure

... Images not created due to minor errors (likely they are too big)

Error 2:

Making interactive html output file "... Making summary page

... Making per operon type page ... ... GC Type -0001 ... ... GC Type 00000 ... ... GC Type 00001 ... ... GC Type 00002 ... ... GC Type 00003 ... ... GC Type 00004 ... ... GC Type 00005 ... ... GC Type 00006 ... ... GC Type 00007 ... ... GC Type 00008 ... ... GC Type 00009 ... ... GC Type 00010 ... ... GC Type 00011

Finished 4216020: Writting summary table"

Group 0000 represents the main cluster with over 700 sequences and includes my proteins of interest, however, the html file just leads to a blank webpage. All of the other html files work properly and the summary html file works and shows that Group 0000 has conserved domains. I have deleted all of the files to rerun the command multiple times and every time the Group 0000 html file (14.5 MB) leads to an empty webpage. Also, even when I use "-print_color_summary True" there is no legend or color summary like in some of the example images and movie. Is that no longer a feature?

JoanaMPereira commented 8 months ago

Hi Anthony,

I am happy that is solved :)

Regarding the other 2 issues:

Error 1: This is a message that GCsnap prints when there are a lot of inputs and so it does not generate a static figure. It goes instead to generate the interactive only. This is not a bug but a feature. Too large input sets led to very large images with too many layers, which are both too heavy and hard to visualise.

Error 2: This is a problem we are aware of and are working on ways to fix it. This happens too because the data is too large... You may see that the same happens with the main summary html file.

In the meantime, while we are working on fixing it, my suggestion is to either use (1) a reduced input set or (2) run GCsnap for each cluster in your clans map individually.

AnthonyRish12 commented 8 months ago

Hi Joana,

Thank you for clarifying. I had a feeling that was the issue. Here's to hoping you guys can fix this issue in the future. As a biochemist/structural biologist, I think this is a really cool program that gives non-bioinformatics or non-genetics people the chance to do genome analysis!

Best wishes, Anthony

AnthonyRish12 commented 8 months ago

In case other people have this issue in the future, you can open larger html files in safari more easily than some other common search browsers, such as Google Chrome. This is especially true for html files that require more than 1GB of memory to visualize.

JoanaMPereira / GCsnap

b'BLAST options error: Please provide a database name using -out\n' #5