MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181

Closed zhenjiaofenjie closed 2 months ago

zhenjiaofenjie commented 2 months ago

Greatings!

I've made some optimizations and fixes that I believe will significantly improve performance and clarify error messaging. Here's a summary of the changes:

TLDR

These changes have been tested thoroughly and have shown significant improvements in both efficiency and usability. I believe they will enhance the overall user experience and streamline the analysis process.

In detail

I'm recently comparing 49 deep-sequencing metagenomes (~100 GB per sample), the compare process takes almost forever at Step 2 running group 1 of 220. I think your suggestions of comparing one genome each time in #148 is a good start and really did the trick (~3 hours per genome).

But after I added more debug loggings in the source code, the finding is very surprising: In the one-genome case it took just 30 minutes to load each profile but took another 40 minutes before the parallelized comparing workers are actually running (note the time lap between the last cumulative_snv_table and the first WorkerLog).

24-05-03 15:47:26 DEBUG    Checkpoint Compare multiprocessing start 448057344
24-05-03 15:47:26 INFO     Running group 1 of 1
24-05-03 15:47:26 DEBUG    Loading KT_MT_tpm..FDZ127__total_bwa.bam
24-05-03 15:47:36 DEBUG    covT 10.300886
24-05-03 15:47:51 DEBUG    cumulative_snv_table 15.016690
24-05-03 15:47:51 DEBUG    Loading KT_MT_tpm..FDZ128__total_bwa.bam
24-05-03 15:47:58 DEBUG    covT 7.444016
24-05-03 15:48:19 DEBUG    cumulative_snv_table 20.384073
24-05-03 15:48:19 DEBUG    Loading KT_MT_tpm..FDZ129__total_bwa.bam
24-05-03 15:48:26 DEBUG    covT 7.212746
24-05-03 15:48:35 DEBUG    cumulative_snv_table 8.734889
...
24-05-03 16:14:07 DEBUG    Loading KT_MT_tpm..TY1-18__total_bwa.bam
24-05-03 16:14:17 DEBUG    covT 9.986339
24-05-03 16:14:32 DEBUG    cumulative_snv_table 15.182714
24-05-03 16:51:40 DEBUG
WorkerLog Compare FDZ134|k141_1831661 start 201064448 1714726293.2355466 1334316

After further digging, it turns out that the function inStrain.compare_utils.subset_SNP_table is called for every scaffold sample pair. In my case, it's 69 49 = 3381 calls, and scales up very quickly if you have more scaffolds and samples. Although each call on this function takes less than 1 sec, the total running time can easily be magnitudes higher than all the other steps.

To overcome this, I added a new function inStrain.compare_utils.hash_SNP_table to split the SNP tables based on scaffolds and store them using dictionary. This function needs only to be called once per sample, and for the rest of the for loop you only need to get the subset SNP tables from the dictionary, which costs almost no time.

After this small modification, comparing 155750 scaffolds among 49 samples took only 2 min. after loading all covTs and SNP tables and before the compare worker is actually running.

24-05-04 21:37:58 DEBUG    Loading KT_MT_tpm..FDZ150__total_bwa.bam
24-05-04 21:38:10 DEBUG    covT 12.349708
24-05-04 21:38:39 DEBUG    cumulative_snv_table 28.679268
24-05-04 21:39:19 DEBUG    Load cache done.
24-05-04 21:41:28 DEBUG
WorkerLog Compare FDZ147|k141_383270 start 341340160 1714830028.4265416 1282417

I also did another small fix on the error message of plotting functions 1 and 3. When running with --dadabase_mode, they says:

"you don't have all required information. You need to run inStrain genome_wide first"

But the actual reason is:

"Plot cannot be created when run with --database_mode or --skip_mm_profiling".

I welcome your feedback and suggestions for further improvements.

Thank you!

Best regards, Jing

MrOlm commented 2 months ago

Hi @zhenjiaofenjie - Thanks a million for this! What a delightful surprise to find this weekend, and thank you very much for submitting this change as a pull request.

I ran a couple tests, though, and I'm hitting an issue when one of the profiles being compared doesn't have any called SNVs. Running the following command on the attached files should reveal the error:

inStrain compare -i /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.100.0.subset.sorted.bam.IS /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.99.9.subset.sorted.bam.IS -o /Users/mattolm/Programs/inStrain/test/test_backend/testdir/testR --store_mismatch_locations

I tried doing a simple modification (below), but then the resulting dataframe didn't find the SNV locations. When I run that command on the master branch, the resulting testR_pairwise_SNP_locations.tsv file should have 11 SNVs called.

I'm sorry I couldn't fix this error myself, but I'd love to merge this into the master if we can get this working.

Thanks, Matt

    """
    Split SNP table according to scaffold
    """
    if len(db) == 0:
        print('empty')
        return {}

    db = db.sort_values(['scaffold', 'mm'])
    ukeys, index = np.unique(db['scaffold'], True)
    dbhash = {}
    for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
        dbhash[key] = db.iloc[idx1:idx2]

    return dbhash

share.zip

zhenjiaofenjie commented 2 months ago

Hi Matt,

I'm terribly sorry about the stupid mistake in taking care of the dataframe index. It by accident skips the last scaffold in every SNP table, which is hard to notice when you have more than one scaffold to compare. Lucky you catch it in time because I've used the hacked version to run several jobs 😅 The correct version should be:

def hash_SNP_table(db):
    """
    Split SNP table according to scaffold
    """
    if len(db) == 0:
        print('empty')
        return {}

    db = db.sort_values(['scaffold', 'mm'])
    ukeys, index = np.unique(db['scaffold'], True)
    index = np.append(index, len(db))  # add the index of the last row in db
    dbhash = {}
    for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
        dbhash[key] = db.iloc[idx1:idx2]

    return dbhash

Let me know if this works.

Best, Jing

MrOlm commented 2 months ago

Wonderful- this change addressed the issue. I've merged your changes in the master branch of inStrain, and assigned it version 1.9.0. Thanks again for contributing!