atifrahman / HAWK

Hitting associations with k-mers
GNU General Public License v3.0
46 stars 20 forks source link

log_reg_case.R & log_reg_control.R execution halted #6

Closed kfletcher88 closed 5 years ago

kfletcher88 commented 5 years ago

Hi, Thanks for this exciting repository.

I was using the runHAWK script on some of my own data and the provided e.coli examples. I run in to the same error on both data sets at the loop https://github.com/atifrahman/HAWK/blob/master/log_reg_case.R#L66-L109

It always exits on

Error in read.table(con, nrow = CHUNK_SIZE) : no lines available in input
Execution halted

However the files pvals_case_top.txt and pvals_control_top.txt do have 200,000 lines each, so I assume I am ok to continue with this error message?

wc -l pvals_control_top.txt
200000 pvals_control_top.txt

wc -l pvals_case_top.txt
200000 pvals_case_top.txt

If so, I have another query. I am not sure if the execution of converToFasta is correct. After running runHAWK case_kmers.fasta contained 5664 k-mers while control_kmers.fasta contained 0. Is this correct? I didn't recieve any error messages suggesting that is terminated prematurely and the rm commands at the end of runHAWK were executed.

grep -c '^>' case_kmers.fasta
5664
grep -c '^>' control_kmers.fasta
0

These numbers don't seem in line with what is significant in the two pvals*top_merged_sorted.txt files.

awk '$1 < 0.05' pvals_case_top_merged_sorted.txt | wc -l
181402
awk '$1 < 0.05' pvals_control_top_merged_sorted.txt | wc -l
144012

Finally, is there a reason that only the top 200,000 kmers from each *out_w_bonf_sorted.kmerDiff are used? Would there ever be a situation where you think this should be increased or decreased?

Thanks in advance for you help!

atifrahman commented 5 years ago

Yes, it's fine to continue with the message.

It's likely that results of convertToFasta are correct. They are after Bonferroni correction and not just the ones smaller than 0.05. No entries in control_kmers.fasta could be due to small number of control samples or the phenotype due to presence of a gene in the cases.

No particular reason. If number of k-mers found significant after the R script is very high (close to 200,000) then 200,000 should be increased. It can be decreased to save time.

kfletcher88 commented 5 years ago

Thanks for the clarification!