Maggi-Chen / Inspector

A tool for evaluating long-read de novo assembly results
MIT License
21 stars 9 forks source link

what happened to p-value? #6

Open kevfengler227 opened 2 years ago

kevfengler227 commented 2 years ago

Hi Maggie, This is a very interesting tool. I am thoroughly testing it in comparison to other polishing approaches I have been using, so I should have so good feedback soon. I am particularly interested to see how it handles small N-gaps introduced by BioNano hybrid scaffolding that can easily be spanned by HiFi reads.

What happened to the p-value parameter? I see it in the documentation, but not in v1.0.2. This could be very helpful to increase the quality of polishing.

Also, v1.0.2 still shows v1.0.1 as the version.

Thanks, Kevin

kevfengler227 commented 2 years ago

Also, what is the size threshold or other distinguisher between a small error and structural error? I am seeing relatively small INDELs that are contained within HiFi reads that could be better handled as small error rather than triggering a local re-assembly.

For example this 72bp INDEL, it is really just a "small error" given 20 kb HiFi reads, but it a structural error. image

kevfengler227 commented 2 years ago

Here is an example of a 1,165 bp INDEL that is classified as a structural error, but fails local re-assembly because of the nearby heterozygous SNP. This could easily be handled as a small scale error. Would it be possible to add a parameter to set a max value to be considered a small scale error. For 20 kb PacBio Hifi reads, coupled with minimap2 alignment, up to 3 kb INDELs could easily be handed as small scale errors.

image

Also, it may be a good idea to add a minimum alignment score to the minimap2 alignment. Often, in the absence of a assembled region in the assembly, reads will align to the most similar region at low alignment score and can cause small-scale errors. For example, a minimum alignment score of 10000 is a reasonable value for >15 kb HiFi reads

Below are some spurious alignments with AS < 15000 causing small scale errors. These errors also have a low p-value, so it could be addressed that way too.

image