BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
115 stars 10 forks source link

Changing default compression from none to gzip will cause DAS tool to fail #153

Closed ElderMedic closed 7 months ago

ElderMedic commented 7 months ago

upgrading from old ver to latest semibin2 cause errors in our pipeline, details here: https://github.com/cmks/DAS_Tool/issues/102

luispedro commented 7 months ago

We are keeping the older version for backwards compatibility (and will do so for at least a while, although we will start printing a warning in the next release that people should upgrade). Note that you can get the same functionality with both the older SemiBin script and the newer SemiBin2, just that the defaults have changed.

You can always use compression=none as well to keep the older behaviour in this particular case.

luispedro commented 7 months ago

I am generally a big proponent of always gzipping not just for the disk space benefits but also for the checksumming (which I think is a less obvious benefits): most instances of corruption will get caught by gzip

The most typical corruption is having a partial file (from a partial copy, for example), which would normally get processed just fine downstream; but gzip will catch the error and report early end of file

ElderMedic commented 7 months ago

Thank you for your quick reply and explanation, indeed with --compression none the problem is solved. The issue is mainly to remind ppl who might encounter this in the future for smooth troubleshooting.