BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Documenter clearer that writers/readers take ownership of underlying IO #70

Closed dan-sprague closed 2 years ago

dan-sprague commented 2 years ago

Hi,

I am trying to write out fastq files generated from a Nanopore MINION machine. There seems to be issues with sequences not being written correctly, sometimes. There definitely appears to be an issue encoding/decoding the quality from these files. Nanopore claims they use the Sanger standard PHRED encoding, which seems to be supported by the FASTX package.

Actually upon checking, there appears to be issues with the FASTA writer as well. The demo below fails to write the FASTA file. However the converted sequences println fine??

CODE:

using FASTX
using BioSequences
using ArgParse

@doc """
Take a fastq file and convert U's to T's because nanopore annoying outputs U's

input -> {inp -> path to fastq ; out -> path to output file} 
output <- new fastq file at path "out"
""" -> 
function uToT(in::IO,out::IO)
    reader,writer = FASTQ.Reader(in),FASTQ.Writer(out)
    for record ∈ reader
        seq = sequence(LongRNA{2},record)
        seq = convert(LongDNA{2},seq)
        write(writer,FASTQ.Record(identifier(record),description(record),seq,
        quality(record,:sanger))) # must specify this for nanopore! 

    end 
    return 
end 

s = ArgParseSettings()
@add_arg_table s begin
    "-i"
        help = "path to input fastq"
        arg_type = String
    "-o"
        help = "output fastq path"
        arg_type = String

end 

args = parse_args(s)

inp = open(args["i"])
out = open(args["o"],"w")
uToT(inp,out)
close(inp)
close(out)

Short demo file. When you run the code on this, nothing will save.

@56bcbdb8-4aa0-45b0-ac5e-34e8635c6d0e runid=a83ffa912f48cb1dcfd0618644311df0055deded sampleid=1 read=11254 ch=35 start_time=2022-02-19T00:36:07Z
CCGGUAGAGGAACCACACCAAUACCCCGAACUGGUGGUAAACUCUACUGCGGUGACGGCAUUGUAGGAGGUCCUGCGGAAAAAAAUAGCUCGACAGGAUAAAAAAAAU
+
&,(%&*3/31:840(0&'*-%%%'+.65>>*1/.-,)./123-*&%.>=642854,*'**&7:2738A447.187<64=>=:7--06<79748-0++358:85-,+'&
@296e2952-5ee9-4ac0-bb90-da39e2e2261e runid=a83ffa912f48cb1dcfd0618644311df0055deded sampleid=1 read=10784 ch=167 start_time=2022-02-19T00:31:52Z
AAGGCGUAAGAGGAACCACUCAAUCCAUCCGAACACUGGUGGUUAAACUCUACUGCGGUGACGAUACUGUGAGGGAGGUCCUGCGGAAAAUAGCUCGACGCCAGGAAAAAAGC
+
'9;;.%'(/4/0(25/;3:-3668157.164/%-#'4<9=,6A1<<<5'$'((0/,**-3*15<5(/.)-)+())&,0$,'45897>B=8318.-678+0131+&--*(&%$$
@65602197-98f5-4d51-a4eb-656be3af2d05 runid=a83ffa912f48cb1dcfd0618644311df0055deded sampleid=1 read=11565 ch=25 start_time=2022-02-19T00:39:33Z
CAGGCGUAGAGGAACCACACCAAUCCAUCCCGAACUUGGUGGUUAAACUCUACUGCGGUGACGAUACUGUAGGGAGGUCCUGCGGAAAAAUAGCUUGUGCCAGGAAAAAAU
+
#334-)&293558:0/'>8<:;9540.(+998??:;?;9<53084*)&&%/2;:78/-/98-37646225&0=A?7766,567,):469'8:951'(&%--+))'58/&&&
jakobnissen commented 2 years ago

This is not a bug, actually. What is happening is that you are not closing the writer, but is closing the underlying IO stream. The writer has its own buffer, so when you close the underlying IO without closing the writer, the buffer does not get flushed.

So, to do it correctly, close the reader/writer in uToT, and do not close inp or out.

I think the issue is that it is not clearly enough documented that the readers/writers take ownership of the underlying IO.

jakobnissen commented 2 years ago

Here is a small example to demonstrate:

julia> using FASTX

julia> function test(path, correct::Bool)
           io = open(path, "w")
           w = FASTA.Writer(io)
           write(w, FASTA.Record("a", "a"))
           correct ? close(w) : close(io)
       end;

julia> test("bad.fna", false)

julia> test("good.fna", true)

shell> cat bad.fna # empty

shell> cat good.fna # has data
>a
a
jakobnissen commented 2 years ago

Closed in #72