The total number of locks exceeds the lock table size at /software/orthomclSoftware-v2.0.9/bin/orthomclPairs line 70

wuxiaopei0509 commented 4 years ago

Hi, @apetkau sorry to bother you again. the pipeline stopped at orthomclPairs statge. I have 13 species and each haves about 25000~50000 proteins .How should I set the innodb_buffer_pool_size ?Now the innodb_buffer_pool_size is 128M。 Thank you

apetkau commented 4 years ago

Hello @wuxiaopei0509.

The innodb_buffer_pool_size is set in your MySQL/MariaDB/database configuration (https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size).

I unfortunately cannot help you to figure out what size to set it beyond to say that it should be increased to a higher value. It will likely require some trial and error to figure out the exact value.

wuxiaopei0509 commented 4 years ago

Thank you , I have changeed the size and run again now.

wuxiaopei0509 commented 4 years ago

Hi， @apetkau I changed the innodb_buffer_pool_size to 2G and run sucessfully. Now I have another question ，the blast parameter in the orthomcl-pipeline.conf blast: F: 'm S' b: '100000' e: '1e-5' v: '100000' what does the F mean? Thank you very much！

wuxiaopei0509 commented 4 years ago

HI， @apetkau Sorry to bother you.I have another question. the results of the pipeline is groups.txt. I want to know orthomclSingletons which genes that are not contained in any groups. How could I obtain it? Thank you!

apetkau commented 4 years ago

Hello @wuxiaopei0509. Sorry, I must have missed these questions.

F: 'm S' will run blast with -F 'm S' which will enable masking/filtering:

Also known as filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence. https://www.ncbi.nlm.nih.gov/books/NBK62051/#masking

Some of these parameters originally came from recommendations by the authors of OrthoMCL:

Run all‐versus‐all BLASTP with goodProteins.fasta as the BLAST database and subject sequences.

Input:

goodProteins.fasta

Output:

your_blast_results_in_tab_format

This document does not provide assistance on the details of running BLASTP. For large datasets you should consider gaining access to a compute cluster. When you do so, you will need to: (1) use NCBI BLAST; (2) run with the ‐m 8 option to provide tab delimited output required by step 8.

Use these options:

‐F 'm S' ‐v 10000 ‐b 10000 ‐z db_size ‐e 1e‐5 –m 8

where: ‐F 'm S' signifies “mask with Seg”; ‐v 10000 is a “don't care” value; ‐b 10000 is a “don't care” value; ‐z db_size is the number of proteins in the set (see “Incrementally add a genome” below); and ‐e 1e‐5 is the recommended e‐value. https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/0471250953.bi0612s35

For your second question, you could try using grep to find genes in the groups.txt file that are singletons. You could maybe compare these to the original gene identifiers found in the input FASTA files to make sure nothing is missed. I do not have any other specific advice on how to get this working though.

apetkau / orthomcl-pipeline

The total number of locks exceeds the lock table size at /software/orthomclSoftware-v2.0.9/bin/orthomclPairs line 70 #38