Multithread Usage Bug - Githubissues

Austin-s-h commented 5 years ago

Hello! Thanks for publishing this tool, I look forward to comparing it with the other clustering methods I have tried. I ran into two difficulties while using it.

First of all, when using more than 1 processor (ex. -np 8), I run into a piping error. The pipeline hangs up at 80% completion of Seed clusters production for 5+ minutes (I killed it afterward). However, running clust on the exact same data with -np 1 fixes this issue. I am testing it on an Ubuntu 18.04 server, let me know if you would like me to do any further diagnostics.

Additionally, I don't know if it is worth reporting as an actual bug, but when all of the data is contained within one file...

clust_input1    HH6 HH6_pGFP_2, HH6_pGFP_3
clust_input1    HH8 HH8_pGFP_3,HH8_pGFP_1
clust_input1    HH10    HH10_pGFP_A1, HH10_pGFP_3, HH10_pGFP_2
clust_input1    HH12    HH12_pGFP_A1, HH12_pGFP_2, HH12_pGFP_1
clust_input1    HH14    HH14_pGFP_A1, HH14_pGFP_3, HH14_pGFP_2
clust_input1    HH16    HH16_pGFP_1, HH16_pGFP_A12, HH16_pGFP_A1

specifying -d as anything but 1 results in an error. Would it be possible to make this detect at the replicate level instead of dataset? or is this parameter more meant for multi-dataset analysis.

Thanks, Austin

BaselAbujamous commented 5 years ago

Hi Austin

Thanks for using clust and for reporting these two issues.

Multi-threading: Although the current version of clust is fast and it might not be essential to use multi-threading, I am eager to fix this bug. Appreciated your offer to do further diagnostics, Austin :) How was your total memory usage doing? Did you try running it again and got the same problem? I am just wondering if it is a one-off deadlock thing or a consistent bug.
The -d parameter This is only relevant for multiple datasets. Therefore, it must be set to 1 for a single dataset (it's meaningless otherwise). This parameter is to filter out genes that do not EXIST at least in -d datasets. This doesn't look at these genes' expression values, rather just checks if they exist in the dataset or not.

If you are interested in filtering out genes with low expression at the level of conditions, consider using other parameters. For example:

-fil-v 10 -fil-c 2 -fil-d 1

This means: filter out genes that do not have at least an expression value of 10, at least in 2 conditions, at least in 1 dataset. Note that -fil-c filters at the condition level and not the replicate level (you have 6 conditions and 16 replicates as I can see). Indeed, if there is a single dataset, -fil-d will be 1 by default.

You can also use this alternative way (which I personally like):

--fil-perc -fil-v 25 -fil-c 2 -fil-d 1

The --fil-perc option means: read the value of -fil-v as a percentile rather than as an absolute expression value. In other words, this example will keep genes that have expression values greater than or equal to the 25th percentile of all expression values in the data, at least 2 conditions, at least in 1 dataset. In this way, the expression threshold will be calculated from the data itself using percentiles rather than being given manually.

Please let me know if this works or if you need any extra help!

All the best, Basel

Austin-s-h commented 5 years ago

Thanks for the quick reply Basel! I am running this on a 16 core Xeon W-2145 with 64GB of DDR4. The memory usage was stable (and low) you can see that the processes were taking around 120Mb of RAM each, and that was constant once the program hung up at 80%. I had this problem with -np 8 and -np16 multiple times. I can try running updates and restarting to see if that fixes anything.

When running with -np 1 I had a slightly higher RAM usage (160Mb), but the script didn't hang up and completed in under 2 minutes.

BaselAbujamous commented 5 years ago

Hi Austin

I thought about the issue of multithreading that you have reported. I am unsure what would have caused it. However, I don't expect big gain in time at a given large number of cores, especially that the dataset seems to be small (you said one core did it in under 2 minutes). Did you try -np 2 or -np 4? Or did you try it when submitting multiple datasets? Hanging up at 80% sounds like a problem in cross-thread talks in a small job.

Please don't use your time to do further experiments if you haven't already done them. I already appreciate your time in writing up this issue.

Thanks Basel

alexharkess commented 5 years ago

Just a quick note that I experience the same issue with multithreading (any value that isn't -np 1), with a hang at 80% at the seed cluster stage. Changing to -np 1 fixes it.

SchwarzEM commented 5 years ago

I've just re-encountered this same bug -- namely, when trying to run clust with "-n 4" or "-n 8", it hangs forever at the "80%" stage of seed clusters production.

I experienced this on two different Linux systems: one was Linux Mint 18 Sarah (GNU/Linux 4.4.0-21-generic x86_64), the other Linux CentOS (3.10.0-693.21.1.el7.x86_64 #1 SMP). In both cases, I was working with the "Python package version 1.10.0 (2018)" version of clust.

And in both cases, when I checked the issues page, found this bug report, and tried "-n 1" instead, the jobs completed easily in minutes!

This delayed my getting the results by half a day. Not a huge deal -- but since this seems to be a recurring bug, and since it may delay other users of clust, could some kind of warning about it be made prominent? Either on the main documentation page for clust, or the help-arguments message one gets with "clust" alone or "clust --help", or both?

BaselAbujamous commented 5 years ago

Hi all,

Thanks a lot for reporting this bug repetitively. As Clust has become a lot faster since its first release, and as running Clust on multiple threads seems to cause this problem which I am not so clear on its reason, I have gone with @SchwarzEM 's suggestion to force -np 1 for now. In fact I have forced the -np option to be 1 now in version 1.10.7. A warning will appear if it was set otherwise.

Thanks again Basel

BaselAbujamous / clust

Multithread Usage Bug #12