Closed jakobnissen closed 2 years ago
Merging #75 (9529d82) into master (5e7efd6) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #75 +/- ##
=======================================
Coverage 84.39% 84.39%
=======================================
Files 12 12
Lines 660 660
=======================================
Hits 557 557
Misses 103 103
Flag | Coverage Δ | |
---|---|---|
unittests | 84.39% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/fasta/readrecord.jl | 96.42% <ø> (ø) |
|
src/fasta/reader.jl | 89.85% <0.00%> (ø) |
|
src/fasta/writer.jl | 96.29% <0.00%> (ø) |
|
src/fastq/reader.jl | 89.36% <0.00%> (ø) |
|
src/fastq/writer.jl | 96.77% <0.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 5e7efd6...9529d82. Read the comment docs.
So, I'm actually not all that against this, because correctness can still be maintained in the form of say LongDNA{2}(record)
or the sequence
method. So invalid characters not permitted by an alphabet can still be caught, but it would allow for when some tools or people use their own screwy files based loosely on FASTA.
A concrete example of this is the files produced by a tool called KAT Sect, which output's kmer counts along a sequence in a FASTA like format i.e.
>seqA
30 34 1 38 44
With this PR, one could parse the file, and then decide what to do with the sequence section e.g. parse the string into a vector of ints. Of course, a dedicated kat sect parser would be better... and hopefully the sect analysis will be doable from Kmers.jl anyway without relying on kat.
Superseded by #68
This PR allows all printable ASCII characters in FASTA sequences. That is, all bytes represented by the characters
'!':'~'
but not>
. Like before, horizontal whitespace, i.e.\t\v
and space is allowed inside sequences, but are not considered part of the sequence.I think this character set is the broadest possible set that is practically parseable. Expanding it further would mean allowing non-printable characters, which would be a complete mess, or Unicode, which would be another complete mess.
This PR is meant just to toss the idea out there, for debate. I have no strong intuition it is actually a good idea.