feat: use bwa-aln-interactive and upgrade developer's docs

clintval commented 3 weeks ago

Closes https://github.com/fulcrumgenomics/prymer/issues/5 Closes https://github.com/fulcrumgenomics/prymer/issues/30 Closes https://github.com/fulcrumgenomics/prymer/issues/27

This PR also:

Addresses improper installation of poetry.
Adds some polish to the main README and developer documentation.

clintval commented 3 weeks ago

Local and remote tests hang forever when you install bwa-aln-interactive from bioconda and run tests. The executable from bioconda is either broken or this library isn't compatible with it yet.

My full stack trace is:

Stack Trace

```python :1: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = ref = PosixPath('tests/offtarget/data/miniref.fa'), max_hits = 1, executable = 'bwa-aln-interactive' max_mismatches = 3, max_mismatches_in_seed = 3, max_gap_opens = 0, max_gap_extensions = -1, seed_length = 20 reverse_complement = False, include_alt_hits = False, threads = 16 def __init__( self, ref: Path, max_hits: int, executable: str | Path = "bwa-aln-interactive", max_mismatches: int = 3, max_mismatches_in_seed: int = 3, max_gap_opens: int = 0, max_gap_extensions: int = -1, seed_length: int = 20, reverse_complement: bool = False, include_alt_hits: bool = False, threads: Optional[int] = None, ): """ Args: ref: the path to the reference FASTA, which must be indexed with bwa. max_hits: the maximum number of hits to report - if more than this number of seed hits are found, report only the count and not each hit. executable: string or Path representation of the `bwa-aln-interactive` executable path max_mismatches: the maximum number of mismatches allowed in the full query sequence max_mismatches_in_seed: the maximum number of mismatches allowed in the seed region max_gap_opens: the maximum number of gap opens allowed in the full query sequence max_gap_extensions: the maximum number of gap extensions allowed in the full query sequence seed_length: the length of the seed region reverse_complement: reverse complement each query sequence before alignment include_alt_hits: if true include hits to references with names ending in _alt, otherwise do not include them. threads: the number of threads to use. If `None`, use all available CPUs. """ threads = os.cpu_count() if threads is None else threads executable_path = ExecutableRunner.validate_executable_path(executable=executable) self.reverse_complement: bool = reverse_complement self.include_alt_hits: bool = include_alt_hits self.max_hits: int = max_hits missing_aux_paths = [] for aux_ext in BWA_AUX_EXTENSIONS: aux_path = Path(f"{ref}{aux_ext}") if not aux_path.exists(): missing_aux_paths.append(aux_path) if len(missing_aux_paths) > 0: message: str if len(missing_aux_paths) > 1: message = "BWA index files do not exist:\n\t" else: message = "BWA index file does not exist:\n\t" message += "\t\n".join(f"{p}" for p in missing_aux_paths) raise FileNotFoundError(f"{message}\nPlease index with: `{executable_path} index {ref}`") # -N = non-iterative mode: search for all n-difference hits (slooow) # -S = output SAM (run samse) # -Z = interactive mode (no input buffer and force processing with empty lines between recs) command: list[str] = [ f"{executable_path}", "aln", "-t", f"{threads}", "-n", f"{max_mismatches}", "-o", f"{max_gap_opens}", "-e", f"{max_gap_extensions}", "-l", f"{seed_length}", "-k", f"{max_mismatches_in_seed}", "-X", f"{max_hits}", "-N", "-S", "-Z", "-D", f"{ref}", "/dev/stdin", ] super().__init__(command=command) # HACK ALERT # This is a hack. By trial and error, pysam.AlignmentFile() will block reading unless # there's at least a few records due to htslib wanting to read a few records for format # auto-detection. Lame. So a hundred queries are sent to the aligner to align enable the # htslib auto-detection to complete, and for us to be able to read using pysam. num_warmup: int = 100 for i in range(num_warmup): query = Query(id=f"ignoreme:{i}", bases="A" * 100) fastq_str = query.to_fastq(reverse_complement=self.reverse_complement) self._subprocess.stdin.write(fastq_str) self.__signal_bwa() # forces the input to be sent to the underlying process. > self._reader = sam.reader(path=self._subprocess.stdout, file_type=sam.SamFileType.SAM) prymer/offtarget/bwa.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ path = <_io.TextIOWrapper name=20 encoding='UTF-8'>, file_type = , unmapped = False def reader( path: SamPath, file_type: Optional[SamFileType] = None, unmapped: bool = False ) -> SamFile: """Opens a SAM/BAM/CRAM for reading. To read from standard input, provide any of `"-"`, `"stdin"`, or `"/dev/stdin"` as the input `path`. Args: path: a file handle or path to the SAM/BAM/CRAM to read or write. file_type: the file type to assume when opening the file. If None, then the file type will be auto-detected. unmapped: True if the file is unmapped and has no sequence dictionary, False otherwise. """ > return _pysam_open(path=path, open_for_reading=True, file_type=file_type, unmapped=unmapped) ../../../.miniforge-x86/envs/prymer/lib/python3.12/site-packages/fgpyo/sam/__init__.py:309: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ path = <_io.TextIOWrapper name=20 encoding='UTF-8'>, open_for_reading = True file_type = , unmapped = False, kwargs = {'mode': 'r'} def _pysam_open( path: SamPath, open_for_reading: bool, file_type: Optional[SamFileType] = None, unmapped: bool = False, **kwargs: Any, ) -> SamFile: """Opens a SAM/BAM/CRAM for reading or writing. This function permits reading from standard input and writing to standard output. The specified path may be the UNIX conventional `"-"`, the more explicit `"stdin"` or `"stdout"`, or an absolute path to either of the standard streams `"/dev/stdin"` or `"/dev/stdout"`. When writing to standard output, the file type must be specified. Args: path: a file handle or path to the SAM/BAM/CRAM to read or write. open_for_reading: True to open for reading, false otherwise. file_type: the file type to assume when opening the file. If None, then the file type will be auto-detected for reading and must be a path-like object for writing. unmapped: True if the file is unmapped and has no sequence dictionary, False otherwise. kwargs: any keyword arguments to be passed to `pysam.AlignmentFile`; may not include "mode". """ if isinstance(path, (str, Path)): if str(path) in _STDIN_PATHS and open_for_reading: path = sys.stdin elif str(path) in _STDOUT_PATHS and not open_for_reading: assert file_type is not None, "Must specify file_type when writing to standard output" path = sys.stdout else: file_type = file_type or SamFileType.from_path(path) path = str(path) elif not isinstance(path, _IOClasses): # type: ignore open_type = "reading" if open_for_reading else "writing" raise TypeError(f"Cannot open '{type(path)}' for {open_type}.") if file_type is None and not open_for_reading: raise ValueError("file_type must be given when writing to a file-like object") # file_type must be set when writing, so if file_type is None, then we must be opening it # for reading. Hence, only set mode in kwargs to pysam when file_type is set and when # writing since we can let pysam auto-recognize the file type when reading. See discussion: # https://github.com/pysam-developers/pysam/issues/655 if file_type is not None: kwargs["mode"] = "r" if open_for_reading else "w" + file_type.mode else: assert open_for_reading, "Bug: file_type was None but open_for_reading was False" if unmapped and open_for_reading: kwargs["check_sq"] = False # Open it alignment file, suppressing stderr in case index files are older than SAM file with fgpyo.io.suppress_stderr(): > alignment_file = pysam.AlignmentFile(path, **kwargs) E KeyboardInterrupt ../../../.miniforge-x86/envs/prymer/lib/python3.12/site-packages/fgpyo/sam/__init__.py:290: KeyboardInterrupt ```

clintval commented 3 weeks ago

EDIT: I solved the issue. See later comments.

I've narrowed the issue down to this "hack" which does not appear to work for me:

https://github.com/fulcrumgenomics/prymer/blob/e942368ca0e28d5f0533c6fc5a6b6b35b59cf764/prymer/offtarget/bwa.py#L287-L300

Increasing the records for warmup doesn't resolve the issue.

This is difficult to debug because stderr is suppressed for everything (bwa, htslib, etc.).

Forcing flush/unbufferd IO with PYTHONUNBUFFERED=1 doesn't work either.

Although it initially appears pysam AlignmentFile is to blame (hence the hack), I have discovered the issue is not actually with pysam at all. If I remove the dependency on AlignmentFile and process lines into AlignmentHeader and AlignedSegment objects directly (using the from_text() and from_string() class methods) then I see that bwa-aln-interactive does not flush output immediately. It appears to buffer and hang at times. The number of threads you assign to the executable do matter!

Removing the "hack" is easy and I suggest we do it anyway.

But it exposes an issue with bwa-aln-interactive.

clintval commented 3 weeks ago

I figured it out. The GitHub release of bwa-aln-interactive is broken and we are shipping the non-interactive bwa. Next, I will fix up the public repository and the bioconda package by incrementing the build number for bwa-aln-interactive.

As a side quest, I was able to remove the AlignmentFile "hack"!

I wish prymer had better logging redirects for the underlying code it calls. Debugging this was painful without knowing where all the standard errors go (hint... it's /dev/null and the user can't change that).

msto commented 2 weeks ago

I wish prymer had better logging redirects for the underlying code it calls. Debugging this was painful without knowing where all the standard errors go (hint... it's /dev/null and the user can't change that).

Can you open a follow-up issue for this? I agree

clintval commented 2 weeks ago

I wish prymer had better logging redirects for the underlying code it calls. Debugging this was painful without knowing where all the standard errors go (hint... it's /dev/null and the user can't change that).

Can you open a follow-up issue for this? I agree

You have one open for Primer3:

https://github.com/fulcrumgenomics/prymer/issues/15

I wrote one up for bwa:

https://github.com/fulcrumgenomics/prymer/issues/41

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.25%. Comparing base (a132a52) to head (a6205c9). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #32 +/- ## ========================================== - Coverage 97.25% 97.25% -0.01% ========================================== Files 25 25 Lines 1603 1602 -1 Branches 303 302 -1 ========================================== - Hits 1559 1558 -1 Misses 23 23 Partials 21 21 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

clintval commented 6 days ago

@msto's tested the developer docs and tests passed.

I think this is ready for review!

fulcrumgenomics / prymer

feat: use bwa-aln-interactive and upgrade developer's docs #32

Codecov Report