Closed Austin-s-h closed 4 years ago
Hi Austin
Thanks for using clust and for reporting these two issues.
Multi-threading: Although the current version of clust is fast and it might not be essential to use multi-threading, I am eager to fix this bug. Appreciated your offer to do further diagnostics, Austin :) How was your total memory usage doing? Did you try running it again and got the same problem? I am just wondering if it is a one-off deadlock thing or a consistent bug.
The -d parameter This is only relevant for multiple datasets. Therefore, it must be set to 1 for a single dataset (it's meaningless otherwise). This parameter is to filter out genes that do not EXIST at least in -d datasets. This doesn't look at these genes' expression values, rather just checks if they exist in the dataset or not.
If you are interested in filtering out genes with low expression at the level of conditions, consider using other parameters. For example:
-fil-v 10 -fil-c 2 -fil-d 1
This means: filter out genes that do not have at least an expression value of 10, at least in 2 conditions, at least in 1 dataset. Note that -fil-c filters at the condition level and not the replicate level (you have 6 conditions and 16 replicates as I can see). Indeed, if there is a single dataset, -fil-d will be 1 by default.
You can also use this alternative way (which I personally like):
--fil-perc -fil-v 25 -fil-c 2 -fil-d 1
The --fil-perc option means: read the value of -fil-v as a percentile rather than as an absolute expression value. In other words, this example will keep genes that have expression values greater than or equal to the 25th percentile of all expression values in the data, at least 2 conditions, at least in 1 dataset. In this way, the expression threshold will be calculated from the data itself using percentiles rather than being given manually.
Please let me know if this works or if you need any extra help!
All the best, Basel
Thanks for the quick reply Basel! I am running this on a 16 core Xeon W-2145 with 64GB of DDR4. The memory usage was stable (and low) you can see that the processes were taking around 120Mb of RAM each, and that was constant once the program hung up at 80%. I had this problem with -np 8 and -np16 multiple times. I can try running updates and restarting to see if that fixes anything.
When running with -np 1 I had a slightly higher RAM usage (160Mb), but the script didn't hang up and completed in under 2 minutes.
Hi Austin
I thought about the issue of multithreading that you have reported. I am unsure what would have caused it. However, I don't expect big gain in time at a given large number of cores, especially that the dataset seems to be small (you said one core did it in under 2 minutes). Did you try -np 2 or -np 4? Or did you try it when submitting multiple datasets? Hanging up at 80% sounds like a problem in cross-thread talks in a small job.
Please don't use your time to do further experiments if you haven't already done them. I already appreciate your time in writing up this issue.
Thanks Basel
Just a quick note that I experience the same issue with multithreading (any value that isn't -np 1), with a hang at 80% at the seed cluster stage. Changing to -np 1 fixes it.
I've just re-encountered this same bug -- namely, when trying to run clust with "-n 4" or "-n 8", it hangs forever at the "80%" stage of seed clusters production.
I experienced this on two different Linux systems: one was Linux Mint 18 Sarah (GNU/Linux 4.4.0-21-generic x86_64), the other Linux CentOS (3.10.0-693.21.1.el7.x86_64 #1 SMP). In both cases, I was working with the "Python package version 1.10.0 (2018)" version of clust.
And in both cases, when I checked the issues page, found this bug report, and tried "-n 1" instead, the jobs completed easily in minutes!
This delayed my getting the results by half a day. Not a huge deal -- but since this seems to be a recurring bug, and since it may delay other users of clust, could some kind of warning about it be made prominent? Either on the main documentation page for clust, or the help-arguments message one gets with "clust" alone or "clust --help", or both?
Hi all,
Thanks a lot for reporting this bug repetitively. As Clust has become a lot faster since its first release, and as running Clust on multiple threads seems to cause this problem which I am not so clear on its reason, I have gone with @SchwarzEM 's suggestion to force -np 1 for now. In fact I have forced the -np option to be 1 now in version 1.10.7. A warning will appear if it was set otherwise.
Thanks again Basel
Hello! Thanks for publishing this tool, I look forward to comparing it with the other clustering methods I have tried. I ran into two difficulties while using it.
First of all, when using more than 1 processor (ex. -np 8), I run into a piping error. The pipeline hangs up at 80% completion of Seed clusters production for 5+ minutes (I killed it afterward). However, running clust on the exact same data with -np 1 fixes this issue. I am testing it on an Ubuntu 18.04 server, let me know if you would like me to do any further diagnostics.
Additionally, I don't know if it is worth reporting as an actual bug, but when all of the data is contained within one file...
specifying -d as anything but 1 results in an error. Would it be possible to make this detect at the replicate level instead of dataset? or is this parameter more meant for multi-dataset analysis.
Thanks, Austin