Open mherold1 opened 2 years ago
Hi @mherold1! which version of MMseqs2 are you using? I could try to reproduce your error if you share the clu_seqDB (or a subset of it if too big)
All the best,
Chiara
Hi and thanks for the quick reply.
I tried to follow the installation script and I'm using the following version of MMseqs2:
MMseqs Version: 2f1db01c5109b07db23dc06df9d232e82b1b4b99-MPI
I attached my mmseqs_clustering
directory:
mmseqs_clustering.tar.gz
I was using the test dataset: https://ndownloader.figshare.com/files/25473332
Best regards, Malte
Hi, I found the problem. No threads threshold was set for the clu_seqDB creation (now fixed: https://github.com/functional-dark-side/agnostos-wf/blob/b649044d359b9a43a2b4194e6d77661206000549/db_creation/rules/mmseqs_clustering_results.smk#L45). The number of threads determines the number of DB files MMseqs is creating and the number of files that then have to be concatenated in a single DB. In our testing cloud, the max number of threads used by MMseqs was the same as the default threads specified in the rule to concatenate the files. In your case not, causing some of the clu_seqDB files to be left out from the final DB.
To avoid rerunning the entire rule, you can recreate the clu_seqDB:
and re-concatenate the files:
This will not affect the results and the other rules.
Let me know if it works!
Best regards,
Chiara
Thanks!
Changing L45 to:
{params.mmseqs_bin} createseqfiledb {params.seqdb} {params.cludb} {params.cluseqdb} --threads {threads} 2>{log.err}
solved this for me.
Actually I had a very similar issue before when the number of threads for rule mmseqs_clustering (set as 28 in the rule) was not the same as the number of threads in mmseqs_clustering_results (set in config.yaml)
In general I am a bit confused on the amount of resources to provide for each step, and what the hierarchy is, when for certain steps e.g. the number of threads is defined in config/cluster.yaml and in the rule itself. I set all the threads parameters and values in the cluster config file back to the default and that helped somewhat until the step cluster_classification which takes very long or fails at this step:
scripts/mmseqs_double_search.sh --search /project/home/p200005/agnostos-wf/bin/mmseqs --mpi_runner 'srun --mpi=pspmi' --ltmp /project/scratch/p200005/tmp --cons /project/scratch/p200005/agnostos_test/db_creation3_default/cluster_classification/refined_not_annotated_cluster_cons.fasta --db_target /project/home/p200005/agnostos-wf/databases/uniref90.db --db_info /project/home/p200005/agnostos-wf/databases/uniref90.proteins.tsv.gz --evalue_filter scripts/evalue_filter.awk --evalue_threshold 0.6 --hypo_threshold 1.0 --hypo_patterns scripts/hypothetical_grep.tsv --grep rg --output /project/scratch/p200005/agnostos_test/db_creation3_default/cluster_classification/noannot_vs_uniref90.tsv --outdir /project/scratch/p200005/agnostos_test/db_creation3_default/cluster_classification --threads 28
Hello,
I have been trying to get the db_creation workflow to run and I am stuck at the step
cluster_compositional_validation
: In the logfile (logs/cval_stderr.err) it reads:The size of "request offset" varied in the different testruns that I made. I can do
mmseqs view
on the databaseclu_seqDB
without error.In the output folder compositional_validation there are several tmp index files, most ending on .0 except
comp_valDB_tmp_0_tmp_0.index
, but not the one from the log file is missing:I tried to look into the script
compositional_validation.sh
, but I cannot find what could be going wrong at this step. Any help would be appreciated.Best regards.