carolzhou / multiPhATE

Throughput PhATE processing of draft or finished phage genomes
14 stars 7 forks source link

default parameters seem to persist over user specified ones for blastn #15

Closed Calvinblue closed 4 years ago

Calvinblue commented 4 years ago

Hi Carol,

So I had a look in phate_sequenceAnnotation_main.log, and I was wondering why there was such high e values and identity min/select for blastn, as the identity was not the one specified in the .config file and the e values are very high. On the other hand blastp values seemed normal, as you can see in the following extract of the log file

Creating a blast object 2020-01-25 21:38:52.799866 Preparing to run blastp at the following settings: Parameters: blast flavor blastp identity min 50 evalue min 0.01 identity select: 70 evalue select: 0.01 topHitCount: 3 scoreEdge: 0.1 overhang 0.1 outputFormat: 5 Creating an hmm object 2020-01-25 21:38:52.800219 Preparing to run blastn at the following settings: Parameters: blast flavor blastn identity min 20 evalue min 10.0 identity select: 20 evalue select: 10.0 topHitCount: 3 scoreEdge: 0.1 overhang 0.1 outputFormat: 5 2020-01-25 21:38:52.800251 Preparing to run blastn at the following settings: Parameters: blast flavor blastn identity min 20 evalue min 10.0 identity select: 20 evalue select: 10.0 topHitCount: 3 scoreEdge: 0.1 overhang 0.1 outputFormat: 5 Running Blast against phage genome database(s)...

I had a look into phate_sequenceAnnotation_main.py and noticed there where some discrepancies in the way the different flavors of blast are called :

genome blast parameters are :

'identityMin' : GENOME_IDENTITY_MIN, # Should not be hard coded, but should be a low-stringency setting 'identitySelect' : GENOME_IDENTITY_SELECT, # this one too 'evalueMin' : EVALUE_MIN, 'evalueSelect' : EVALUE_SELECT,

gene blast parameters are : myParamSet = { 'identityMin' : blastpIdentity, # Setting as per blastp, for now 'identitySelect' : blastpIdentity, # this one too 'evalueMin' : 10, 'evalueSelect' : 10,

whereas protein blast parameters are :

myParamSet = { 'identityMin' : int(blastpIdentity), 'identitySelect' : int(blastpIdentity), 'evalueMin' : EVALUE_MIN, 'evalueSelect' : EVALUE_SELECT,

So I edited the file by using the same parameters for blastn and blastp, and it worked for the % identity but not for the evalue, as you can see in the following log file

Creating a blast object 2020-01-25 22:46:30.244293 Preparing to run blastp at the following settings: Parameters: blast flavor blastp identity min 50 evalue min 0.01 identity select: 70 evalue select: 0.01 topHitCount: 3 scoreEdge: 0.1 overhang 0.1 outputFormat: 5 Creating an hmm object 2020-01-25 22:46:30.244639 Preparing to run blastn at the following settings: Parameters: blast flavor blastn identity min 60 evalue min 10.0 identity select: 60 evalue select: 10.0 topHitCount: 3 scoreEdge: 0.1 overhang 0.1 outputFormat: 5 2020-01-25 22:46:30.244672 Preparing to run blastn at the following settings: Parameters: blast flavor blastn identity min 60 evalue min 10.0 identity select: 60 evalue select: 10.0 topHitCount: 3 scoreEdge: 0.1 overhang 0.1 outputFormat: 5 Running Blast against phage genome database(s)...

In the end I manually changed the evalue min and select to 0.01 as a temporary solution (which seemed to work). I do not really understand python code but I believe this part of the code is not working as you intended it to work. Could you please give me advice on this ?

Thanks in advance!

Best,

Antoine

carolzhou commented 4 years ago

This issue has been repaired. The user's value for percent identity specified in the multiphate config file is carried through and used to filter low-identity blastn (gene) and blastp (protein) hits. The genome hits are set at a minimum of 20% and not configurable. The phate_sequenceAnnotation_main.log file now reports blast object settings at the correct time, thus reporting the values that have been set just prior to running blast.

Calvinblue commented 4 years ago

Thank you for this update :)