EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

esl_sqio_ReadBlock max init window flag #46

Closed nawrockie closed 4 years ago

nawrockie commented 4 years ago

This pull request accompanies infernal pull request 18 and hmmer pull request 174.

This pull request includes support for a new int (effectively boolean) argument to the esl_sqio_ReadBlock() (both ascii and ncbi versions) of

. When set to TRUE, this causes the function to define windows identically for sequences regardless of their order in a sequence file. Comments from the Purpose: section of the esl_sqio_ReadBlock() function: > If is true and is TRUE, > the first window read from each sequence (of length L) > is always min(L, ). If > is FALSE, then the length of the first window read from > each sequence is calculated differently as > max( - size, * .05); > where size is total number of residues already existing > in the block. == TRUE mode was added > to ensure that the window boundaries read are not dependent > on the order of the sequence in the file, thus ensuring > reproducibility if (for example) a user extracts one > sequence from a file and reruns a program on it (and all > else remains equal). When is set to TRUE and ( is also TRUE) this reverts behavior to pre-commit 0fadd80 in which Travis added new code to prevent blocks of possibly 2X length from being read, which makes threaded implementations more efficient in certain situations. However, that change also resulted in removing reproducibility of window definition when sequences are in a different order. This was reported as a bug in Infernal by Patricia Chan (tRNAscan-SE developer) who found that cmsearch identified a tRNA hit (Z) in a sequence file with a single sequence (X) but not in a sequence file with multiple sequences where sequence X was not the first sequence. The problem was that windows in X were defined differently in the two scenarios. ``` File 1 has 2 sequences: A, length 230218 X, length 813184 File 2 has 1 sequence: X, length 813184 -- Block definitions using current code ( = 100000): File 1 block 1: sequence A, 1..100000 (100Kb) block 2: sequence A, 999373..200000 (100Kb) block 3: sequence A, 199373..230218 (~30Kb) and sequence X, 1..69782 (~70Kb) (remainder of blocks irrelevant) File 2 block 1: sequence X, 1..100000 (100Kb) (remainder of blocks irrelevant) Note that the first window of sequence X differs depending on which which file is being read using current code. -- Block definitions using new pull request code and set to TRUE (same behavior as pre- 0fadd80 code): File 1 block 1: sequence A, 1..100000 (100Kb) block 2: sequence A, 999373..200000 (100Kb) block 3: sequence A, 199373..230218 (~30Kb) and sequence X, 1..100000 (100Kb) File 2 block 1: sequence X, 1..100000 (100Kb) (remainder of blocks irrelevant) Note that the first window of sequence X is identical for both files using pull request code. ``` This is a problem because hits can be filtered out or not depending on the window they are in.