BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.

https://biojulia.dev

MIT License

61 stars 20 forks source link

[RFC]: Broaden characters allowed in FASTA sequences #75

Closed jakobnissen closed 2 years ago

jakobnissen commented 2 years ago

This PR allows all printable ASCII characters in FASTA sequences. That is, all bytes represented by the characters '!':'~' but not >. Like before, horizontal whitespace, i.e. \t\v and space is allowed inside sequences, but are not considered part of the sequence.

I think this character set is the broadest possible set that is practically parseable. Expanding it further would mean allowing non-printable characters, which would be a complete mess, or Unicode, which would be another complete mess.

This PR is meant just to toss the idea out there, for debate. I have no strong intuition it is actually a good idea.

codecov[bot] commented 2 years ago

Codecov Report

Merging #75 (9529d82) into master (5e7efd6) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #75   +/-   ##
=======================================
  Coverage   84.39%   84.39%           
=======================================
  Files          12       12           
  Lines         660      660           
=======================================
  Hits          557      557           
  Misses        103      103

Flag	Coverage Δ
unittests	`84.39% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/fasta/readrecord.jl	`96.42% <ø> (ø)`
src/fasta/reader.jl	`89.85% <0.00%> (ø)`
src/fasta/writer.jl	`96.29% <0.00%> (ø)`
src/fastq/reader.jl	`89.36% <0.00%> (ø)`
src/fastq/writer.jl	`96.77% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5e7efd6...9529d82. Read the comment docs.

TransGirlCodes commented 2 years ago

So, I'm actually not all that against this, because correctness can still be maintained in the form of say LongDNA{2}(record) or the sequence method. So invalid characters not permitted by an alphabet can still be caught, but it would allow for when some tools or people use their own screwy files based loosely on FASTA.

A concrete example of this is the files produced by a tool called KAT Sect, which output's kmer counts along a sequence in a FASTA like format i.e.

>seqA
30 34 1 38 44

With this PR, one could parse the file, and then decide what to do with the sequence section e.g. parse the string into a vector of ints. Of course, a dedicated kat sect parser would be better... and hopefully the sect analysis will be doable from Kmers.jl anyway without relying on kat.

jakobnissen commented 2 years ago

Superseded by #68