Open leoisl opened 6 years ago
If hawk ran correctly and case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff are empty, that probably means none of the k-mers had low enough p-value to pass Bonferroni correction even before correcting for population structure. Is that possible?
If it's an organism with small genome, it may be worth trying again with smaller value of CUTOFF in line 3 in kmer.h https://github.com/atifrahman/HAWK/blob/master/kmer.h
Hello,
thanks for the support on solving this. I was able to verify that even case_out_wo_bonf.kmerDiff
and control_out_wo_bonf.kmerDiff
are empty, so consequently case_out_w_bonf.kmerDiff
and control_out_w_bonf.kmerDiff
are empty. However, gwas_eigenstratX.geno
and gwas_eigenstratX.snp
contain each 260k lines. Due to this, I am suspecting my execution is not entering this if
: https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L646 . I will try to debug this, if you have any idea, I would appreciate it!
Thanks a lot!
Hello again,
I got some debug values:
sigLevel = 0.05
CUTOFF = 100000000
pValThreshold = 5e-10
ht->kmers[i][j]->pVal = 0.993234
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = y
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.288889
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = n
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.731537
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = n
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.527426
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = y
ht->kmers[i][j]->significanceType = p
ht->kmers[i][j]->pVal = 0.731537
pValThreshold = 5e-10
ht->kmers[i][j]->forPval = n
ht->kmers[i][j]->significanceType = a
ht->kmers[i][j]->pVal = 0.288889
pValThreshold = 5e-10
So, sometimes ht->kmers[i][j]->forPval == 'n'
, in which case we don't enter the if
. I don't know why this happens. But I think the most worrying is that it seems all the p-values are way higher than pValThreshold
, so all kmers are probably being filtered out by the this variable. pValThreshold
is very, very small because we divide sigLevel
by the CUTOFF
at https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L724 . Maybe this is related to your last answer... I am working with bacterial genomes, so they don't go over 10Mbp. In addition, I am using contigs, not raw reads (I remove the part in jellyfish command-line that removes low-frequency k-mers due to this). Do you think that setting CUTOFF = 1
, thus letting pValThreshold = 0.05
is an option? Does this make sense in HAWK?
Thanks a lot!
In this case, you can try setting CUTOFF=100000
. 1
is too low since Bonferroni correction is performed anyway. If it still doesn't work, we're probably underpowered (the Poisson assumption is not really true for contig data).
Hello,
I tried CUTOFF=100000
but no success... There were 15 k-mers in case_out_wo_bonf.kmerDiff
but 0 in control_out_wo_bonf.kmerDiff
, so the pipeline still bugged when executing the R
script. I am not sure if it helps, but I was able to successfully run HAWK v0.8.3-beta on ~10 bacterial datasets, but it is kind of hard to pinpoint what should I change in HAWK v0.9.8-beta to be able to run it as well, since there were lots of changes: https://github.com/atifrahman/HAWK/compare/0.8.3...master. Do you have any other suggestions? Maybe my data isn't fit for HAWK v0.9.8-beta (as you said, we're probably underpowered (the Poisson assumption is not really true for contig data)), and I should run HAWK v0.8.3-beta instead? Does HAWK v0.8.3-beta also assume Poisson assumption?
Thanks again!
There actually isn't a whole lot of difference between the hawk.cpp
in versions 0.9.8 and 0.8.3 with regard to case_out_wo_bonf.kmerDiff
and control_out_wo_bonf.kmerDiff
. HAWK v0.8.3-beta also assumes Poisson distribution. We hadn't thought of k-mer data from contigs previously but we will look into it.
Dear Atif Rahman,
I took a detailed look at the diff of hawk.cpp
between the two versions, while debugging the code on the datasets I couldn't make HAWK v0.9.8-beta
work, and I reached the conclusion that the main reason for HAWK v0.9.8-beta
being unable to output significant kmers in these specific datasets is because many of them were being filtered out on the following condition: https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L503-L510 . By removing this condition (i.e. ht->kmers[i][j]->forPval='y';
is always executed), I am able to retrieve significant kmers.
However, I am not sure if removing this condition makes sense for the method. Is it somehow related to the minimum allele frequency (maf)? We usually use a maf of 1% when running similar tools on our datasets, and would this translate to replacing the condition on line https://github.com/atifrahman/HAWK/blob/master/hawk.cpp#L503 to if(presentRatio>=0.01 && presentRatio<=0.99)
or not? If not, could you please explain what this condition does and which values would you propose to make it less stringent?
Thank you again!
Kind regards.
Hello,
I tried running Hawk, but I got stuck when running the
runHawk
script. I reran therunHawk
script with the-eux
bash flags (bash -eux ./runHawk
) in order to check what is happening in each step and where is the error. The log file is here: HAWK_error.txtThe command-line that fails is
Rscript ~/HAWK/hawk-0.9.8-beta/log_reg_case.R
. However, although~/HAWK/hawk-0.9.8-beta/hawk 47 70
seems to work, I could verify thatcase_out_w_bonf.kmerDiff
andcontrol_out_w_bonf.kmerDiff
are both empty, and it seems this is the reason that the Rscript fails.Could I have any help on this?
Thank you in advance. Kind regards.