bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Straglr seems to use multiple threads than specified threads #13

Closed Jesson-mark closed 8 months ago

Jesson-mark commented 2 years ago

Hi, @readmanchiu .

I have been using Straglr to detect VNTRs of my ONT data. After running multiple samples using 1 or 3 cpus(specified by --nproc parameter of Straglr), I found that our computing nodes was running busy, which isn't a normal status. When program is using more threads(or cpus) than specified, the computing node will be busy.

Although this busy status is only lasting a few minutes(about 20 minutes), when multiple programs are running on the same computing node, this busy status is extremely high, which may harm node.

After some tests, I found that after the temp files of TRF are generated, the computing node begin to running busy. But I couldn't figure out which step after that is using more threads. Could you give me some suggestions or directions on how to deal with this problem?

Looking forward to your reply. Thanks.

Jesson-mark

readmanchiu commented 2 years ago

Hi @Jesson-mark, thanks for trying Straglr first of all. It shouldn't use more than the specified number of cpus. I will take a look to make sure of that. But there is an inherent problem right now for using BLASTN to compare motifs from different stages of the program. (A lot of small BLASTN jobs will be fired but they still shouldn't use more than the number of specified CPUs) This is especially problematic for long VNTRs as there are many permutations of the motif. Right now I don't have a simple solution, as any solution could significantly alter the results. But this issue needs to be addressed and I will try to figure out a solution ASAP. In the meantime is it feasible to request exclusively a single node for each sample? then you can avoid multiple runs on the same node

Jesson-mark commented 2 years ago

Hi @readmanchiu , thanks for your prompt reply!

Our computing nodes have multiple CPUs(typically 28 CPUs) and the computing resources are shared with others so I cannot request exclusively a single node for each sample. Besides, I also observed that even if I was using only one CPUs, our computing node is also running busy about 10 minutes. What I am trying now is to apply more threads than specified in Straglr(currently three times of threads specified in Straglr). Even so, this phenomenon is also observed. But the load of node is smaller which may be feasible in the short term.

I am very glad that this issue is going to be addressed and I will do my best to help you with this.

Thanks.

Jesson-mark

Jesson-mark commented 2 years ago

Hi @readmanchiu , I have two questions that have confused me.

First question is about the genome scan mode of Straglr. I was using genotype mode of Straglr to detect VNTRs from simpleRepeat.txt(from UCSC) of my ONT data. Since you mentioned that many BLASTN jobs are deployed for comparing motifs between the different phases of genome scan, I was a little confused with how genome scan works in genotyping mode.

Another question is about BLASTN jobs. If there are a lot of small BLASTN jobs running at the same time, would it possible to add some sleep steps or some condition tests to the loop of comparing motifs to wait for these small BLASTN jobs to complete? If that solution is feasible, there would not be many tasks running at the same time and our computing node may not be running busy. If my solution may be a feasible workaround that may not make our computing node too busy, could you give me some suggestions on which step could these sleeps or condition tests be added to?

Thanks a lot.

Jesson-mark

readmanchiu commented 2 years ago

Yes, I forgot BLASTN was also used in the genotype mode. The genotyping mode is basically the second step of the genome scan mode, the difference is only that the motif for each locus in provided by user in the genotyping mode instead of being detected in the first step of the genome scan mode. So in the genotyping step, BLASTN is used to compare the intended and the detected motif. This is necessary because sometimes the sequence at the specified locus flanked by the specified is made up of non-repeat sequence (e.g. mobile element). And Straglr only reports the size of the repeat in these cases. Often motifs do not start at the same base or possess differences, comparisons cannot be done by simple regex and therefore have to be handled by alignment. A BLASTN alignment should be run at most once per locus, so I don't believe multiple alignment jobs will be spawned from a single process in multi-processing. The issues I see are that there is a lot of temporary files generated (a lot of loci), and sometimes a long alignment can hold up the finish of a processing job. Sorry I don't quite get how adding sleeps or your suggested solution helps.

kmnip commented 2 years ago

@Jesson-mark How do you track the CPU usage? Were you watching from top? I wonder if the "busy" status of your machine has to do with high I/O as opposed to high CPU usage.

Jesson-mark commented 2 years ago

@readmanchiu Thanks for your detailed explanation. Now I have a better understanding of how Straglr works. If there is at most one BLASTN job at one time and many temporary files, I wonder would it be possible to increase memory usage to avoid high I/O operation?

@kmnip Thanks for your comments. My institution is using Portable Batch System(PBS) to performs job scheduling and I cannot check CPU usage of my jobs using top. However, the administrator of our PBS showed me that one of my program was using 1600% CPU(which is 16 threads accoding to the administrator) from top command while I just specified 1 or 3 threads in Straglr. I am not sure if high I/O would make PBS computing node busy as previous similar phenomena are caused by more threads used than applied.

I would try to track the CPU usage and I/O usage of Straglr to check where is the source of this issue.

kmnip commented 2 years ago

I wonder if implicitly specifying -num_threads 1 would make a difference at: https://github.com/bcgsc/straglr/blob/5d02f3e1bc000c1866861f1c148f892c78fc6baa/src/tre.py#L543-L549

Jesson-mark commented 2 years ago

Hi @kmnip , I checked help usage of blastn which shows that default argument of -num_threads is 1, so I wonder there would be no difference when specifying -num_threads 1. But I would like to have a try to see if there may be a difference.

Following is the output of blastn related to num_threads:

 *** Miscellaneous options
 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote
Jesson-mark commented 2 years ago

Hi @readmanchiu , @kmnip , I added -num_threads 1 in line 547 of tre.py and rerun Straglr. I found that the load of computing node is same as which when not specified, which are both higher than threads specified in Straglr

Besides, I have added some info output to tre.py and found that our computing node is running too high load at the end of find_similar_long_patterns_gt or the beginning of groupping alleles by locus(codes are shown bellow). I also noticed that the more VNTRs I supplied through --loci, the higher load of computing node. When I supplied about 33000 VNTRs to genotype, the load is ten times larger than specified threads while supplied 10000 VNTRs, the load is six times larger than specified threads. I am not sure which step may cause high load, could you give me some suggestions?

https://github.com/bcgsc/straglr/blob/5d02f3e1bc000c1866861f1c148f892c78fc6baa/src/tre.py#L582-L589

readmanchiu commented 2 years ago

OK, seems like it might be feasible to combine the blastn's into one per process. I will give it a shot this week. Stay tuned.

Jesson-mark commented 2 years ago

Thanks for your attention to this matter. I will keep following up.

Besides, I have two suggestions that may be useful for debugging. The first one is to output some key information in different stages of Straglr, such as informing users how many loci are loaded and will be genotyped, when to construct input fasta of TRF and when to run GMM for genotyping alleles of VNTRs. The second one is to add out_prefix to temporary files, which may be useful for checking and debugging when running multiple samples at the same time.

These two suggestions may be trivial but they can inform users which step Straglr is running and help them debug problems more conveniently. I have modified some codes of Straglr to achieve that and codes are in my repository: https://github.com/Jesson-mark/straglr_modified. Hope that may help you.

Best wishes, Jesson-mark.

readmanchiu commented 1 year ago

Hi @Jesson-mark, I've finally done something to the blastn jobs. Now there would be only 1 blastn job per process, this would reduce significantly the number of alignments and temporary files. Also relating to the blastn tasks, I also added some filtering criteria to the commands before executing them. This might help eliminating some hanging blastn jobs that took forever to finish. As a result, genome-scan should be faster than before. There are also other changes I introduced as I tinkered with the code, so the results will not be identical to v1.2's, but they should largely be similar. Please give it a try and look forward to your feedback. (The code is updated in the main repo, I haven't tagged it for an official release yet)

Jesson-mark commented 1 year ago

Hi @readmanchiu, sorry for the late reply. I will have a try for the latest Straglr and give you feedback soon.

Jesson-mark commented 1 year ago

Hi @readmanchiu , I've used the latest Straglr to genotype about 31500 VNTRs for a sample sequenced by ONT and compared the results to those genotyped by Straglr v1.2. While Straglr v1.2 genotyped about 28049 VNTRs, the latest Straglr genotyped 27731 VNTRs, 27698 of which are also genotyped by Straglr v1.2. For those VNTRs(27698), 92.3%(25574/27698) have the same position and detected motif. For those 25574(92.3%) VNTRs, 93.8%(23999/25574) have the same allele size. Overall, copy numbers and motifs of the majority of VNTRs are identical.

Then I tested whether this version of Straglr would make our computing system busy. Unfortunately, like Straglr v1.2, the latest Straglr can also make it busy and the load is similar to that of v1.2. Next I tried to find out which step is the source of this problem. After some debugging, I finally found the reason. It is the GaussianMixture of sklearn that can use more than one CPUs . Here are some test codes to replicate this problem.

from time import sleep
import numpy as np
from sklearn.mixture import GaussianMixture

allele1 = np.zeros(200) + 3
allele2 = np.ones(200)
sizes = np.concatenate((allele1, allele2)).reshape((400, 1))
print("shape of sizes: %s" %(str(sizes.shape)))
sleep(5)

max_num_clusters = 2
X = sizes
N = np.arange(1, max_num_clusters + 1)

for i in range(100):
    print("------ Try %s ------"%(i))
    models = [None for i in range(len(N))]
    AICs = []
    BICs = []

    print("Run GaussianMixture")
    for i in range(len(N)):
        models[i] = GaussianMixture(N[i], init_params="k-means++").fit(X)
    # sleep(5)
    sleep(0.5)

    AIC = [m.aic(X) for m in models]
    BIC = [m.bic(X) for m in models]
    M_best = models[np.argmin(AIC)]

    labels = M_best.predict(X)

    nclusters = M_best.n_components
    clusters = [[] for i in range(M_best.n_components)]
    for label, i in zip(labels, X):
        clusters[label].append(i[0])

    # sort clusters
    print("sort clusters")
    for cluster in clusters:
        cluster.sort()
    clusters.sort(key = lambda c:c[0])
    for cluster in clusters:
        m = np.mean(cluster)
        s = np.std(cluster)

When executing these codes, %CPU of this program obtained by top command is greater then 100% which means this programs is using more than one thread. When running the same step of Straglr using only one process, I found that %CPU is greater than 1000! So I think the GMM is the source of that problem.

Next I've checked the usage of GaussianMixture of sklearn and found that the default initialization method(k-means) is more computationally expensive than other methods. I tried k-means++ method as it is much quicker than k-means method and found that this method only use one thread. The modification of Straglr I've made is adding init_params="k-means++" to GaussianMixture in line 26 of cluster.py. Then I tested the difference between those two initialization methods. For randomly selected 1000 VNTRs, only one VNTR has difference in copy number but the difference is small. Besides, I randomly selected two samples to genotype about 31500 VNTRs(sample1_b14, sample2_b27). Below are the summary statistics:

sample | sample1_b14 | sample2_b27 -- | -- | -- Num VNTRs(v1.2) | 28096 | 28287 Num VNTRs(latest) | 27785 | 27867 Not in v1.2 | 47 | 62 Not in latest | 358 | 482 Num intersection | 27737 | 27804 same pos and motif | 0.929(25768/27737) | 0.911(25329/27804) abs_size_diff ≥ 2 | 0.954(26450/27737) | 0.946(26313/27804) abs_size_diff ≥ 6 | 0.985(27312/27737) | 0.980(27241/27804)

Num intersection represents number of same VNTR position reported by the two Straglr. same pos and motif represents number of VNTRs that have the exact same position and detected motif. abs_size_diff ≥ 2 represents number of VNTRs whose difference of allele size are greater than 2. abs_size_diff ≥ 6 represents number of VNTRs whose difference of allele size are greater than 6. Overall, the difference between results of two Straglr is relatively small. So this modification may be a solution to that problem.

In addition, I found another minor problem relating to long runtime of TRF or blastn. While most of my programs are running about 2 hours, I found some programs are running for a long time and many of them are running more than 12 hours. I noticed that these programs are hung at the step of running TRF or blastn. I'm not sure which situation could cause this problem. Is there any solution or method to prevent this problem? Looking forward to your reply.

readmanchiu commented 1 year ago

@Jesson-mark, thanks for your detailed analysis! I am wondering for the 28049-27731=318 loci if you have randomly picked some loci to see why they don't have any results in the new version. It doesn't necessarily mean the results from v1.2 on those loci are reliable. And I also suspect there may be a small number of loci the new version produced results but not by v1.2?

Regarding GMM mult-threading, I found this ticket: https://stackoverflow.com/questions/19257070/unintended-multithreading-in-python-scikit-learn. Maybe you can also give it a try? You compared the results between using kmeans and kmeans++ but are you sure the CPU usage is fine after you switched to using kmeans++? Anyways, I'm willing to give kmeans++ a try to see what kind of results I got in my hands.

And for the long TRF/blastn time, wonder what the largest span of your list of VNTR is. Often it's just one or two troublemakers that bog down the whole process.

Jesson-mark commented 1 year ago

Yes, there indeed are a small number of loci produced by new version but not by v1.2 as shown in the table above. I will inspect some loci to see why there is a difference between the two versions.

For kmeans++ method, I have captured the CPU usage which is at a low level(little then 50, if I remember rightly). Besides, I will try the solutions on this ticket you supplied and to see if there is a difference.

I've calculated the largest span of my VNTR lists. There is a mildly positive correlation between the largest span of VNTR list and the runtime of Straglr. If I use many CPUs to run program for those VNTRs, it can finish in acceptable time. I also wonder if there is any better solution to prevent that problem?

Jesson-mark commented 1 year ago

I've tried solutions on that stackoverflow ticket but it made no difference compared to original setting. T tested all four environment settings(MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, NUMEXPR_NUM_THREADS, OMP_NUM_THREADS) and also tested setting threads using mkl library in python script(in cluster.py and my test.py). But the CPU usage is still too high, even greater than 1000.

I noticed that when setting these environment variables, the numpy library is using only one CPU when doing matrix calculation. But why sklearn still use many CPUs?

Jesson-mark commented 1 year ago

@readmanchiu , I've checked some loci that genotyped by new Straglr but not by v1.2. For the sample I selected and regenotyped, v1.2 genotyped 28049 VNTRs and new version genotype 27715 VNTRs. 366 loci are only genotyped by v1.2 while 32 loci are only genotyped by new version. Then I used these two sets as inputs loci to genotype using the opposite version of Straglr(for loci only in v1.2, use new version; for loci only in new Straglr, use v1.2). Interestingly, these loci can be genotyped now and the total number of regenotyped loci is close to those of previously not genotyped loci. For 366 loci which are only genotyped by v1.2, there are 335 loci that have the same position and 227 loci that also have the same motif. The difference of allele size is little than 6 for about 90%(302/335) loci.

I also tested sample1_b14 and got same results. Those not genotyped by new version are found when regenotyping using new version. I'm a little confused about this strange problem. Could you give any suggestions? Thanks a lot.

readmanchiu commented 1 year ago

@Jesson-mark, so I've tested the k-means++ initiation for gmm, it requires the latest sklearn (v1.1) to be used, I just need to update the requirements. The results seem mostly the same, I don't think changing it would drastically change the results. If this can fix your multi-threading issue and the other methods (the numpy ways) don't work, I'm willing to make the change.

Seems like for now fringe cases such as borderline expansions or support level are mostly contributing to the differences in results from different runs. I think it's due to the annoying stochasticity in GMM clustering results. I don't have a solution right now so you will probably expect to see this in the next release.

Jesson-mark commented 1 year ago

@readmanchiu , yes it requires the latest sklearn. For now I think only this change could address my multi-threading problem. I would really appreciate your modification to Straglr.

I've checked those VNTRs that have differences in results from the two Straglr and found that most of them are located in complex regions (allele sizes of reads supporting each loci are not consistent and maybe differ much). So the differences should not be attributed to Straglr and GMM in my opinion.

Thanks for your consideration to this matter and looking forward to updates of next release of Straglr.

readmanchiu commented 1 year ago

I've changed cluster.py to use k-means++ initialization now, hopefully this relieves the stress on your cluster CPUs

Jesson-mark commented 1 year ago

Thanks a lot for that! That will help me relieve stress on our cluster CPUs.

Besides, now I'm encountering another problem related to the memory usage of Straglr. I'm calling VNTRs from some high-coverage samples (mean depth may be 90x or higher). I noticed that some programs may use about 400gb memory while some may use about 100gb memory. Because there are many samples I need to process, it's important to allocate the memory for each program appropriately. I wonder if there is a simple method to approximately calculate the max memory used by Straglr.

Best regards.

readmanchiu commented 1 year ago

wow, are they whole genome or exome examples? I guess you're doing genotyping, not genome scanning? I guess the only solution I have is divide-and-conquer: make smaller batches for your cluster jobs. (Or if you're doing genome scan, do one job per chromosome). One thing that's the limitation for Python application is multi-threading, which is not "real" multi-threading but mulit-processing where each child job copies all the storage of the parent job. If you know any solution let me know.

Jesson-mark commented 1 year ago

Yes, I'm doing genotyping for whole genome VNTRs. The VNTRs loci are simpleRepeat.txt from UCSC which has about 960000 loci and I split it into 30 batches and each batch has about 31500 loci. So each sample has 30 batches loci to genotype. For most low-coverage samples, the memory used by one batch is less than 20gb. But for some high-coverage samples, memory used by one batch is too high. I'm new to multi-processing in Python but I'd like to have a try for that. Besides, I wonder if it is possible for Straglr to write TRF input into file multiple times which may reduce the memory usage?