JiaoLaboratory / CRAQ

Identification of errors in draft genome assemblies with single-base pair resolution for quality assessment and improvement
https://doi.org/10.1038/s41467-023-42336-w
MIT License
53 stars 5 forks source link

’-d‘ parameter mentioned in subsection 'Classification of assembly errors' of 'Methods section #17

Closed Dylan1021 closed 4 months ago

Dylan1021 commented 4 months ago

Hi there,

Thanks for your great work. I am reading your paper and I'm confused by the subsection 'Classification of assembly errors' where you mentioned a '-d' parameter, but I can't find it in the command or other place. And what is this cutoff '-d' for? What if the coverage differences is lower than 'd'?

Also, ' compares the discrepancy in coverage of SMS reads to the 200-bp regions upstream and downstream of the NGS breakpoint with a 20-bp sliding window' is also unclear to me. How is the sliding window work and what are the upstream and downstream stand for?

Could you please give me more details? Many thanks!

JiaoLaboratory commented 4 months ago

image

Dylan1021 commented 4 months ago

Wow! Many thanks for your detailed explanation, that helps me a lot.

I also have one more question that in your Supplementary file image Since the contig length is 10Mb and according to your description, the length of sliding window size is 0.0001 total assembly size, which should be 10Mb (10,000,000) 0.0001 = 1000 b (1kb), not 10kb block region, and also, the 'L' of AQI calculation should be 1 based on mega-base unit to have 74 or 83 AQI value, but since the assembly size is 10Mb, which I think the 'L' should be 10 and I can't get the same value to AQI.

I am not sure to this part, Looking forward to your reply, thanks~ : )

JiaoLaboratory commented 4 months ago

Sorry for the late reply, I've been traveling on business recently. The setting for our window is calculated based on the total size, not the length of a single contig. Additionally, I'm just using this chart to express our strategic thinking, 'To avoid excessive impacts of specific regions enriched in errors on the overall AQI values.' The 10M and 10kb in the picture have no actual significance.

JiaoLaboratory commented 4 months ago

Here, you could assume the total size (L) is set to be 100M. And for one contig (e.g., contig1 = 1M, and there are 3 CREs located in a 10Kb window in contig1. The calculation for contig1’s AQI would be 100e^-0.1*(1+1/2+1/3)/1 = 83. The following scenario might be easier to understand and closer to reality. image

JiaoLaboratory commented 4 months ago

contig2

Dylan1021 commented 4 months ago

I see, many thanks for your patience and clear explanation! Hope you doing well~