MrOlm / inStrain

Bioinformatics program inStrain
MIT License
137 stars 33 forks source link

Best practice to compare hundreds of samples #116

Closed zhaoc1 closed 1 year ago

zhaoc1 commented 2 years ago

Hi Matt,

I have a question about inStrain compare. I want to compare hundreds of instrain profiles and the current method is to provide the list of profiles and bamfiles via --bams. However there is an upper limit for a bash command-line length. I wonder if you can recommend the best practice to compare the inStrain profiles of hundreds of samples. Thank you.

Chunyu

MrOlm commented 2 years ago

Hi Chunyu,

Adding text-file input for profiles and bams is near the top of my instrain "To Do" list for this reason, but it's not available yet. A stupid work-around I've used is creating short paths / filenames for each input file / directory using syslinks, since it's the total character count of the command that causes this problem.

I will preemptively note, however, that running more than 80 samples starts to take a really long time. I've run up to 120 samples before, which takes a few days, but after that it takes a really really long time. That's because inStrain compare uses pair-wise comparisons, so it takes an exponentially-increasing about of time to run with more and more samples. If at all possible I would suggest breaking up your samples into smaller (potentially overlapping) groups of <80 samples to avoid this. I do also have plans of implementing a greedy clustering strategy to try and avoid this exponential problem, but that won't be ready for a long time.

Best, Matt

zhaoc1 commented 2 years ago

Thanks for the reply Matt! It makes sense to me. I will try our the short path option for sure. I was planning to run inStrain compare for about 110 samples. Now I need to think more about this plan.

Please feel free to close this issue.

Best, Chunyu