Open lmichael107 opened 8 years ago
Hi Lauren, Are you happy with the clarification suggested by Natasha:
"...it is also possible to subdivide the database file to shorten job duration and increase the total number of jobs required (however, an additional step for normalizing the final merged results is also required)."
Or, would you like me to omit any mention of subdividing the database entirely.
Thanks,
Emelie
For https://github.com/OSGConnect/tutorial-stashcache-blast/blob/master/README.md, database files should not be divided for blast, as the statistics of how well each read matches to a database location depends on the entirety of the database. Unless the user takes extra measures to normalize results from different subdivided database pieces, the E-values in the output will be inaccurate with potentially dire scientific consequences. Specifically, the E-value calculation depends on features of the complete database, like the number of database reads, as described here: https://en.wikipedia.org/wiki/BLAST. We had significant issues with an older blast pipeline at UW-Madison that used split databases, and I believe it's a common misconception that, because it's possible to create multiple smaller database files, you can simply concatenate results across them.