gbouras13 / hybracter

Automated long-read first bacterial genome assembly tool implemented in Snakemake using Snaketool.
MIT License
108 stars 8 forks source link

[BUG] Problem in coverage information #82

Closed CorentinEscobar closed 4 months ago

CorentinEscobar commented 5 months ago

Hi @gbouras13,

I have a problem with the coverage info in output files. I have assembled many genomes sequenced with nanopore and I find coverage between 20 and 30 for all, even for those who have a lot of data at the sequencing output. I checked this hybracter output information by comparing it to my assembly process (which contains the same programs but which is not automated in a pipeline like hybracter) and I found different output informations. For example, for the chromosome of a strain, I have a mean coverage of 24 with hybracter and 81 with my process. Regarding the quantity of data I think that the value of 81 is true. In fact, I have similar coverage values ​​for many strains while I do not have at all the same quantity of data for each of the strains at the sequencing output.

Do you know where the problem could come from ? Does hybracter sort data before or after assembly ? If so, is it possible to modify the code somewhere to remove this sort and keep all the data?

Thanks for your help

Corentin

gbouras13 commented 5 months ago

Hi @CorentinEscobar,

I think you may be specifying a low value for -c or chromosome length, or leaving on the default of 1000000.

Hybracter by default will subsample the FASTQ read set specified to c*subsample_depth number of bases, where subsample_depth is 100 by default.

If you don't want any subsampling (but still want to keep quality QC steps) the best way is to increase --subsample_depth to a very large number (e.g. -subsample_depth 100000).

Alternatively, if your input reads are QC'd already, you can use --skip_qc to skip all QC steps.

George

CorentinEscobar commented 5 months ago

Hi @gbouras13

Thank you for your help ! In fact I stayed on the default value of the chormosome length. If I change this value to get closer to the expected chromosome size, hybracter may take a little bit longer but will use more reads and the coverage will also be higher.

Thanks again !

Corentin