Closed zhenjiaofenjie closed 2 months ago
Hi @zhenjiaofenjie - Thanks a million for this! What a delightful surprise to find this weekend, and thank you very much for submitting this change as a pull request.
I ran a couple tests, though, and I'm hitting an issue when one of the profiles being compared doesn't have any called SNVs. Running the following command on the attached files should reveal the error:
inStrain compare -i /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.100.0.subset.sorted.bam.IS /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.99.9.subset.sorted.bam.IS -o /Users/mattolm/Programs/inStrain/test/test_backend/testdir/testR --store_mismatch_locations
I tried doing a simple modification (below), but then the resulting dataframe didn't find the SNV locations. When I run that command on the master branch, the resulting testR_pairwise_SNP_locations.tsv
file should have 11 SNVs called.
I'm sorry I couldn't fix this error myself, but I'd love to merge this into the master if we can get this working.
Thanks, Matt
"""
Split SNP table according to scaffold
"""
if len(db) == 0:
print('empty')
return {}
db = db.sort_values(['scaffold', 'mm'])
ukeys, index = np.unique(db['scaffold'], True)
dbhash = {}
for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
dbhash[key] = db.iloc[idx1:idx2]
return dbhash
Hi Matt,
I'm terribly sorry about the stupid mistake in taking care of the dataframe index. It by accident skips the last scaffold in every SNP table, which is hard to notice when you have more than one scaffold to compare. Lucky you catch it in time because I've used the hacked version to run several jobs 😅 The correct version should be:
def hash_SNP_table(db):
"""
Split SNP table according to scaffold
"""
if len(db) == 0:
print('empty')
return {}
db = db.sort_values(['scaffold', 'mm'])
ukeys, index = np.unique(db['scaffold'], True)
index = np.append(index, len(db)) # add the index of the last row in db
dbhash = {}
for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
dbhash[key] = db.iloc[idx1:idx2]
return dbhash
Let me know if this works.
Best, Jing
Wonderful- this change addressed the issue. I've merged your changes in the master branch of inStrain, and assigned it version 1.9.0. Thanks again for contributing!
Greatings!
I've made some optimizations and fixes that I believe will significantly improve performance and clarify error messaging. Here's a summary of the changes:
TLDR
inStrain.compare_utils.hash_SNP_table
to pre-process SNP tables based on scaffolds, significantly reducing redundant calculations.inStrain.compare_utils.subset_SNP_table
in the comparison process from<number of scaffolds> * <number of samples>
times to just once per sample, resulting in a drastic reduction in overall processing time.These changes have been tested thoroughly and have shown significant improvements in both efficiency and usability. I believe they will enhance the overall user experience and streamline the analysis process.
In detail
I'm recently comparing 49 deep-sequencing metagenomes (~100 GB per sample), the compare process takes almost forever at Step 2 running group 1 of 220. I think your suggestions of comparing one genome each time in #148 is a good start and really did the trick (~3 hours per genome).
But after I added more debug loggings in the source code, the finding is very surprising: In the one-genome case it took just 30 minutes to load each profile but took another 40 minutes before the parallelized comparing workers are actually running (note the time lap between the last cumulative_snv_table and the first WorkerLog).
After further digging, it turns out that the function
inStrain.compare_utils.subset_SNP_table
is called for every scaffold sample pair. In my case, it's 69 49 = 3381 calls, and scales up very quickly if you have more scaffolds and samples. Although each call on this function takes less than 1 sec, the total running time can easily be magnitudes higher than all the other steps.To overcome this, I added a new function
inStrain.compare_utils.hash_SNP_table
to split the SNP tables based on scaffolds and store them using dictionary. This function needs only to be called once per sample, and for the rest of the for loop you only need to get the subset SNP tables from the dictionary, which costs almost no time.After this small modification, comparing 155750 scaffolds among 49 samples took only 2 min. after loading all covTs and SNP tables and before the compare worker is actually running.
I also did another small fix on the error message of plotting functions 1 and 3. When running with --dadabase_mode, they says:
But the actual reason is:
I welcome your feedback and suggestions for further improvements.
Thank you!
Best regards, Jing