JiaoLaboratory / CRAQ

Identification of errors in draft genome assemblies with single-base pair resolution for quality assessment and improvement
https://doi.org/10.1038/s41467-023-42336-w
MIT License
53 stars 5 forks source link

low_confidence.bed empty file #5

Open agolicz opened 8 months ago

agolicz commented 8 months ago

Hi, Thanks very much for the great tool. We've been trying CRAQ out in several of our assembly projects and we sometimes run into strange behavior where analysis seems to finish ok, but low_confidence.bed is empty and the Low-confident.Rate in out_final.Report is 0. We know for sure that is not the case (from quality scores) and as a parallel run on a different machine gave >100 low confidence points. The fasta file after breaks is also not produced. Would there be a chance to look into it? It is a 1.1Gb genome and we are using a 500Gb mem, 28 CPU machine, so resources should be plenty.

Running CRAQ analysis .........
PARAMETERS:
Genome sequence(-g): yahs.out_scaffolds_final.fa
SMS input(-sms): lr.fq.gz
NGS input(-ngs): SRR10382366_1.fastq  SRR10382366_2.fastq
Minimum NGS clipped-reads (-sn): 2
Minimum SMS clipped-reads (-ln): 2
NGS clipping coverRate(-sf): 0.75
SMS clipping coverRate(-lf): 0.75
Lower clipping rate for heterozygous allele(-hmin): 0.4
Upper clipping rate for heterozygous allele(-hmax): 0.6
Block score benchmarking (-rw): 500000
Gap[N] is treated with (-gm): 1:CRE
Minimum gapsize(-mgs): 10
Break chimera fragments (-b): T
Search error region (-ser): T
Mapping SMS reads use (-x): map-hifi
Mapping quality (-q): 20
Window size for error normalizing (-nw): 103347
Plot CRAQ metrics (-pl): F
Alignment thread(-t): 24
Current working at : /vol/volume/agolicz/napus_microc/noUL/stringent3
CRAQ output dir(-D): /vol/volume/agolicz/napus_microc/noUL/stringent3/output
-------------------------Start Running-------------------------
Running NGS short-reads CRAQ analysis ......
CMD: /vol/volume/agolicz/CRAQ/bin/../src/runSR.sh -g yahs.out_scaffolds_final.fa  -z seq.size -1  SRR10382366_1.fastq -2 SRR10382366_2.fastq -q 20 -m 2 -f 0.4 -h 0.6 -r 0.75 -a 20 -t 24
worker_pipeline::
worker_pipeline:: NGS reads aligning and filtering
[M::worker_pipeline:: Sort bamfiles]
[M::worker_pipeline:: Compute effective NGS coverage]
[M::worker_pipeline:: Collect potential CRE|RH]
SR clipping analysis completed. Check current directory SRout for final results!

Running CRAQ benchmark analysis ......
CMD: /vol/volume/agolicz/CRAQ/bin/../src/runAQI_NGS.sh -g  yahs.out_scaffolds_final.fa  -z seq.size   -e SRout/SR_eff.size  -c SRout/SR_putative.RE.RH  -d SRout/SR_sort.depth  -r 0.75 -p 0.4 -q 0.6  -n 10 -s 103347 -w 500000    -j 1 -u T -v F -y F -x /vol/volume/agolicz/napus_microc/noUL/stringent3/output/seq.size
[M::worker_pipeline:: Collect potential CRE|H]
[M::worker_pipeline:: Get putative CRE]
[M::worker_pipeline:: Search noisy region]
[M::worker_pipeline:: Quality benchmarking]
[M::worker_pipeline:: Create regional metrics]
[M::worker_pipeline:: Create final report]
CRAQ analysis is finished. Check current directory runAQI_out for final results!
JiaoLaboratory commented 8 months ago

Thank you very much for your support. . It seems that the long-read data you provided (lr.fq.gz) did not exact. You supplied long-read data, but CRAQ appears to have only used short-read data;

Is your command was craq -g assembly.fa -sms lr.fq.gz -ngs SRR10382366_1.fastq, SRR10382366_2.fastq -x map-hifi ?

Please check if your 'lr.fq.gz' file exists. Additionally, If you are using symbolic links of "lr.fq.gz" , please ensure that the filename of lr.fq.gz is the same to your raw SMS.fq file.

JiaoLaboratory commented 8 months ago

Additionally, may I ask if your lr.fq.gz file actually exists? If you only have short-reads data, you can use the following command: 'craq -g assembly.fa -ngs SRR10382366_1.fastq, SRR10382366_2.fastq'.

agolicz commented 8 months ago

Indeed my error! I might have just worked out our other issue too. I will report back. Agnieszka

JiaoLaboratory commented 8 months ago

Alright, Agnieszka. Additionally, I need to point out that if you want to verify errors in the genome, the two files "locER_out/out_final.CRE.bed " and "strER_out/out_final.CSE.bed" are more valuable, as they mainly impact the final AQI value and the out_final.Report results. The file "low_confidence.bed" only reports regions in the genome with relatively low reads coverage, and it might be empty. To avoid potential confusion for users , I am considering moving this file to another folder."

If you have trouble, let me know.

agolicz commented 8 months ago

Ok, thanks! Would it be possible to maybe also update the documentation with '"low_confidence.bed" only reports regions in the genome with relatively low reads coverage'? Current wording is a bit ambiguous and I think my students understood it wrong.

I think one other thing we might have run into is that some of our genomes have large chromosomes and need a csi index, but I don't think this is handled.

Agnieszka

JiaoLaboratory commented 8 months ago

Hi, Agnieszka, Thanks, the documentation was update to avoid potential confusion.

If you encounter the following issue : “[E::hts_idx_check_range] Region 631441106..631444822 cannot be stored in a bai index. Try using a csi index [E::sam_index] Read 'sms2' with ref_name='ref1', ref_length=998496443, flags=256, pos=631441107 cannot be indexed samtools index: failed to create index for "LRout/tmp_bam/sms.fa_sort.bam": Numerical result out of range”

Don't worry, CRAQ will continue to run. In fact, I only use the .bai suffix to ensure that the BAM file is sorted.