fulcrumgenomics / fgoxide

Quality of life improvements in Rust.
MIT License
4 stars 3 forks source link

Potential problem with writing gz files #8

Open jrm5100 opened 1 year ago

jrm5100 commented 1 year ago

I'm using fgoxide to write a test involving .vcf.gz files.

I've defined some input lines like

let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();

let input = tempdir.path().join("input.vcf.gz");
let input_data = vec![
        r#"##fileformat=VCFv4.0 "#,
        r#"##build_id=20201027095038"#,
        r#"##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop"#,
        r#"##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">"#,
        r#"##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">"#,
        r#"#CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMN10492695    SAMN10492696    SAMN10492697    SAMN10492698    SAMN10492699    SAMN10492700    SAMN10492701    SAMN10492702    SAMN11605645    SAMN10492703    SAMN10492704    SAMN10492705"#,
        r#"NC_000001.9  144135212   rs1553120241    G   A   .   .   .   AN:AC   8560:5387   8:8 256:224 336:288 32:24   170:117 32:24   18:13   20:15   344:296 288:248 9432:6100"#,
        r#"NC_000001.9  144148243   rs2236566   G   T   .   .   .   AN:AC   5996:510    0:0 0:0 0:0 0:0 0:0 0:0 0:0 84:8    0:0 0:0 6080:518"#,
] ;

io.write_lines(&input, &input_data)?;

My test is failing with Error: bytes remaining on stream, but passing with files I've run manually.

If I extract the generated VCF file and re-compress it with bgzip

gunzip input.vcf.gz
bgzip -c input.vcf > input2.vcf.gz

My tool works. The same issue occurs when running the tool directly on the files (so the test itself isn't the problem, unless I've written the input file incorrectly).

The bgzip version is slightly larger.

There's no difference in the vcf version of each.

diff <(gzcat test_path/input.vcf.gz) <(gzcat test_path/input2.vcf.gz)

input.vcf.gz input2.vcf.gz

tfenne commented 1 year ago

@jrm5100 I wonder - how is your tool reading the VCF? I'm fairly sure that io::write_lines() will write a vanilla gzipped file if the extension is .gz - not a bgzipped file. Perhaps the tool you're running expects gzipped files to be bgzipped?

If that is the case I'd also be open to making io::write_lines() write bgzipped instead of plain gzipped files as that's probably more generally useful.

jrm5100 commented 1 year ago

Yeah, it just occurred to me the incompatibility might be with noodles-bgzf, and I guess the name might confirm that your hunch is correct.

That's what I get for trying to write up a strange issue on a Friday evening without thinking it through.

Not sure what the best approach is. Perhaps .vcf.gz should write bgzip files (since that is the norm for vcf), but I don't know that I'd want to encode too much logic into how file extensions are written.

nh13 commented 1 year ago

What about the following simple rust psuedo-code that could be refactored into its own function. If it's only for testing purposes, that should work. We tend to use this library for compression: https://github.com/sstadick/gzp

let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();
let input = tempdir.path().join("input.vcf.gz");
let writer = BufWriter::new(File::create(&input).unwrap());
let mut bgzf_writer = BgzfSyncWriter::new(writer, Compression::new(3));
bgzf_writer.write_all(b"@NAME\nGATTACA\n+\nIIIIIII\n").unwrap();
bgzf_writer.flush().unwrap();
drop(gz_writer);