Error running runHawk script

leoisl commented 6 years ago

Hello,

I tried running Hawk, but I got stuck when running the runHawk script. I reran the runHawk script with the -eux bash flags (bash -eux ./runHawk) in order to check what is happening in each step and where is the error. The log file is here: HAWK_error.txt

The command-line that fails is Rscript ~/HAWK/hawk-0.9.8-beta/log_reg_case.R. However, although ~/HAWK/hawk-0.9.8-beta/hawk 47 70 seems to work, I could verify that case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff are both empty, and it seems this is the reason that the Rscript fails.

Could I have any help on this?

Thank you in advance. Kind regards.

atifrahman commented 6 years ago

If hawk ran correctly and case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff are empty, that probably means none of the k-mers had low enough p-value to pass Bonferroni correction even before correcting for population structure. Is that possible?

atifrahman commented 6 years ago

If it's an organism with small genome, it may be worth trying again with smaller value of CUTOFF in line 3 in kmer.h https://github.com/atifrahman/HAWK/blob/master/kmer.h

leoisl commented 6 years ago

Hello,

thanks for the support on solving this. I was able to verify that even case_out_wo_bonf.kmerDiffand control_out_wo_bonf.kmerDiff are empty, so consequently case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff are empty. However, gwas_eigenstratX.geno and gwas_eigenstratX.snp contain each 260k lines. Due to this, I am suspecting my execution is not entering this if: https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L646 . I will try to debug this, if you have any idea, I would appreciate it!

Thanks a lot!

leoisl commented 6 years ago

Hello again,

I got some debug values:

sigLevel = 0.05
CUTOFF = 100000000
pValThreshold = 5e-10
ht->kmers[i][j]->pVal = 0.993234
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = y
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.288889
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = n
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.731537
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = n
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.527426
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = y
ht->kmers[i][j]->significanceType = p
ht->kmers[i][j]->pVal = 0.731537
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = n
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.288889
pValThreshold = 5e-10

So, sometimes ht->kmers[i][j]->forPval == 'n', in which case we don't enter the if. I don't know why this happens. But I think the most worrying is that it seems all the p-values are way higher than pValThreshold, so all kmers are probably being filtered out by the this variable. pValThreshold is very, very small because we divide sigLevel by the CUTOFFat https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L724 . Maybe this is related to your last answer... I am working with bacterial genomes, so they don't go over 10Mbp. In addition, I am using contigs, not raw reads (I remove the part in jellyfish command-line that removes low-frequency k-mers due to this). Do you think that setting CUTOFF = 1, thus letting pValThreshold = 0.05 is an option? Does this make sense in HAWK?

Thanks a lot!

atifrahman commented 6 years ago

In this case, you can try setting CUTOFF=100000. 1 is too low since Bonferroni correction is performed anyway. If it still doesn't work, we're probably underpowered (the Poisson assumption is not really true for contig data).

leoisl commented 6 years ago

Hello,

I tried CUTOFF=100000 but no success... There were 15 k-mers in case_out_wo_bonf.kmerDiff but 0 in control_out_wo_bonf.kmerDiff, so the pipeline still bugged when executing the R script. I am not sure if it helps, but I was able to successfully run HAWK v0.8.3-beta on ~10 bacterial datasets, but it is kind of hard to pinpoint what should I change in HAWK v0.9.8-beta to be able to run it as well, since there were lots of changes: https://github.com/atifrahman/HAWK/compare/0.8.3...master. Do you have any other suggestions? Maybe my data isn't fit for HAWK v0.9.8-beta (as you said, we're probably underpowered (the Poisson assumption is not really true for contig data)), and I should run HAWK v0.8.3-beta instead? Does HAWK v0.8.3-beta also assume Poisson assumption?

Thanks again!

atifrahman commented 6 years ago

There actually isn't a whole lot of difference between the hawk.cpp in versions 0.9.8 and 0.8.3 with regard to case_out_wo_bonf.kmerDiff and control_out_wo_bonf.kmerDiff. HAWK v0.8.3-beta also assumes Poisson distribution. We hadn't thought of k-mer data from contigs previously but we will look into it.

leoisl commented 6 years ago

Dear Atif Rahman,

I took a detailed look at the diff of hawk.cpp between the two versions, while debugging the code on the datasets I couldn't make HAWK v0.9.8-beta work, and I reached the conclusion that the main reason for HAWK v0.9.8-beta being unable to output significant kmers in these specific datasets is because many of them were being filtered out on the following condition: https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L503-L510 . By removing this condition (i.e. ht->kmers[i][j]->forPval='y'; is always executed), I am able to retrieve significant kmers.

However, I am not sure if removing this condition makes sense for the method. Is it somehow related to the minimum allele frequency (maf)? We usually use a maf of 1% when running similar tools on our datasets, and would this translate to replacing the condition on line https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L503 to if(presentRatio>=0.01 && presentRatio<=0.99) or not? If not, could you please explain what this condition does and which values would you propose to make it less stringent?

Thank you again!

Kind regards.

atifrahman / HAWK

Error running runHawk script #5