[RFC]: Add functions `guess_parse` and `guess_alphabet`

jakobnissen commented 11 months ago

This PR creates the functions guess_parse and guess_alphabet, which infers an appropriate alphabet for the sequence:

julia> guess_parse("UAUHVCG")
7nt RNA Sequence:
UAUHVCG

julia> guess_parse("LVVWKREFVL")
10aa Amino Acid Sequence:
LVVWKREFVL

Notes for reviewers

This PR does not implement a good API for recoverable parsing, to be used in libraries. That is, it's not a stab at #224 . Rather, it's intended for interactive REPL work
The name could use some bikeshedding! :smile: Ideally, the name
- Is short. It'll be used in the REPL, after all
- Makes it clear that this function is a heuristic / loose / guessing function
Should we have a macro for this? guess"TAGTGCA" or whatever?

Closes #268 Does not close the similar #224

codecov[bot] commented 11 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (1314bbf) 90.89% compared to head (3ae5494) 91.14%. Report is 1 commits behind head on master.

Files	Patch %	Lines
src/alphabet.jl	94.11%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #292 +/- ## ========================================== + Coverage 90.89% 91.14% +0.24% ========================================== Files 31 31 Lines 2395 2416 +21 ========================================== + Hits 2177 2202 +25 + Misses 218 214 -4 ``` | [Flag](https://app.codecov.io/gh/BioJulia/BioSequences.jl/pull/292/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/BioJulia/BioSequences.jl/pull/292/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia) | `91.14% <95.65%> (+0.24%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

kescobo commented 8 months ago

In terms of name, if the goal is for easy typing, I think underscores are a no no (for me personally). guess or guessparse seem ok.

I do like guess"AATTCC", but if you're copying a sequence, you probably know if it's aa or dna or whatever - I'd expect to us this more in a loop or something, where the macro form is less useful.

cjprybol commented 8 months ago

Like Kevin, I can't say I feel like I'm much help on the bit-code, but the rest of this looks great and I'm very excited about this functionality. I don't know if I'm the only one, but it feels like so few bioinformatics tools do any pre-processing validation of fasta files on their own. The number of hours I've wasted debugging code when someone throws a protein fasta into a collection of DNA fastas and uses .fasta for both instead of .fna & .faa extensions 😅

jakobnissen commented 8 months ago

Thanks for your inputs! I'd like to merge soon. I'm still a bit torn on the name though. I agree that guess_parse is awkward. But I also don't like that neither the name nor the argument says anything about what it parses into. Maybe bioseq?

BioJulia / BioSequences.jl

[RFC]: Add functions `guess_parse` and `guess_alphabet` #292

Notes for reviewers

Codecov Report