sarapita commented 2 years ago

Dear John,

I’m using unitig-counter/1.1.0 and then feeding the output into pyseer for LMM-based association analyses. However, I’ve noticed I get different number of significant unitigs when using different values for —nb-cores when running unitig-counter. I then noticed the number of lines in the output of the unitigs file differs in row number (e.g. when using —nb-cores 1 vs 4). Is this occurrence reproducible on your end? If so, is the issue due to parralelization? Is it safest to stick to —nb-cores 1?

Thanks in advance, Sara

johnlees commented 2 years ago

It might be more informative to check exactly which sequences differ between the two rather than just the counts.

But I am not sure I'm afraid, but perhaps sticking to one core might be safest in that case.

sarapita commented 2 years ago

Thanks for your prompt reply!

So I looked a bit more into it and I realise what I posted before isn’t quite right; sorry about this. So I decided to give a bit more context.

I executed 4 runs of unitig-counter (I encounter the same issue when using unitig-caller instead) with —nb-cores 1,2,3 & 4, respectively. I get the same number of rows in the unitigs output file (contrary to what I said in my first post) and same number of unique unitigs (see below), but the unique unitigs are different across the 4 runs.

total unique unitigs

for threads in {1..4}; do zcat count_unitigsthreads${threads}/unitigs.txt.gz | awk '{print $1}' | sort | uniq | wc -l;done 353548 353548 353548 353548

For example, between the run with 1 thread and the run with 2 threads, 311607 unitigs were common in both runs and 41941 reads were reported on the 1-thread run that were not found in the 2-thread run and vice-versa.

anti_join(thread1,thread2) %>% dim Joining, by = "unitigs" [1] 41941 1

inner_join(thread1,thread2) %>% dim Joining, by = "unitigs" [1] 311607 1

I didn't go on to assess how "different" these unitigs were yet, as you alluded to in your previous post.

When inputting unitigs from these different runs into pyseer for a population structure adjusted llm (by employing a similarity matrix), this translated into identical bonferroni significance thresholds, but 4 different similarity matrices, and number of significant hits, especially when going from a 1-thread run to a 2-threaded run (see table below).

wc -l count_unitigsthreads*/significant_unitigs.txt 860 count_unitigs_threads_1/significant_unitigs.txt 53 count_unitigs_threads_2/significant_unitigs.txt 98 count_unitigs_threads_3/significant_unitigs.txt 103 count_unitigs_threads_4/significant_unitigs.txt 1114 total

This is a bit concerning given, for example, only 6 significant unitigs of the 52 found in a 2-thread run are successfully grepped from the significant unitigs in the 1-thread run.

If you have any insight into why this might be happening, I'd appreciate it. Initially, I thought this was a threading issue where not everything was being merged back into the final table, but I wonder now if I might be missing something about the method itself.

I initially ran unitig counting on 2 cores, so in the meantime will see if the annotation hits are similar if I switch to using 1-core.

bacpop / unitig-counter

nb-cores tuning different total row numbers #17

total unique unitigs