Question About Accuracy?

BoredMa commented 5 years ago

I am confused with "accuracy" you mentioned in your paper, I want to know whether "Aligned Bases" and "Accuracy" is exactly the same with "Aligned Bases" and "AvgIdentity" in dnadiff report?

canfirtina commented 5 years ago

Hello @BoredMa,

Sorry for the late response. The answer is yes (i.e., AvgIdentity obtained from "Alignments", and AlignedBases obtained from "Bases"). However, due to version changes, some of the numbers you may have seen in our preprint changed in the recent version of our manuscript, which we did not publish yet.

BoredMa commented 5 years ago

Thanks for your reply. I did some tests on E.Coli K-12 ONT data set using nanopolish tools.The assembler i used is miniasm,and the aligner is minimap2.The nanopolish command I used is python ../nanopolish_makerange.py draft_assembly.fasta | parallel --results nanopolish.results -P 6 nanopolish variants --consensus -o polished.{1}.vcf -w {1} -r Ecoli..pass.fasta -b reads.sorted.bam -g draft_assembly.fasta -t 4 --min-candidate-frequency 0.1 --ploidy=1. as you can see, I used 6*4=24 threads at a time to run this job. May be due to version changes,the numbers differs a lot with yours.I finished the whole process by 5 days, and the AlignedBases is 96.03%, AvgIdentity is 91.84%

BoredMa commented 5 years ago

The draft assembly got a 86.75% AlignedBases and 85.05% AvgIdentity without using any polishing program.

canfirtina commented 5 years ago

Thanks for the detailed answer. We also re-do the experiments as we use the new versions. Below are the most recent results we have for the draft assembly and the assembly polished by nanopolish:

Draft Assembly AlignedBases: 86.68 Draft Assembly AvgIdentity 85.03

Nanopolish assembly AlignedBases: 96.01 Nanopolish assembly AvgIdentity 91.82

We used the following command in our server that schedules the jobs using SLURM:

for i in cat miniasm_makerange.results; do sbatch -c 45 --job-name=nanopolish$i --output=./slurmout/nanopolish$i.out --error=./slurmout/nanopolish_$i.err --wrap="/usr/bin/time -v -p -o polished.$i.time nanopolish variants --consensus -o polished.$i.vcf -w $i -r metrichor.fasta -b alignment_nano.bam -g assembly.fasta -t 45 --min-candidate-frequency 0.1"; done

Your and our results seem close to each other. I believe the minor difference can be due to the basecaller performance as we did not use the most recent version of metrichor.

Let me know if you have any further questions and thanks for your interest in our manuscript. We will hopefully make the most recent version of our manuscript public very soon.

canfirtina commented 5 years ago

Closing due to inactivity

BoredMa commented 5 years ago

Sorry for bothering you again, but i have several furter questions. ①In Assembly construction part, which kind of data you use to construct contig?illumina or pacbio, or both? ②as we know, illumina data have high accuracy,but short length. since uncorrected contigs have lots of errors, short reads having high accuracy prefer aligning to wrong reference position.In my opinion, it seems no helpful with polishing, why you use illumina data to polish contig? ③ Last question, Table S3 in you paper, you describe Racon,Quiver,Pilon can not polish Ashkenazim trio data in 35X PacBio data, is that true?

canfirtina commented 5 years ago

Hi @BoredMa

1) We use PacBio reads to construct the assemblies.

2) You may or may not be correct regarding the further downstream analysis using this polished assembly. There are several works showing that we may actually be losing some of the genes when either/both read error correction or/and assembly polishing steps are applied. The reason is probably several regions become very similar to each other (where they should not be) due to error correction. However, when we use Illumina reads to polish an assembly and compare the polished assembly with the ground truth we clearly see improvement in the accuracy of the polished assembly. Here, the accuracy is basically answers the question: "How similar an assembly to a ground truth?". The ratio gives us an estimate for the accuracy of the polished assembly as the ground truths that we used are usually either from the same sample or from the same organism. In short, since we see an increase in the accuracy of the polished assemblies when we use Illumina reads, we also suggest using Illumina reads when polishing assemblies. 3) We have some updated results that we hope to publish soon. We applied several preprocessing methods to reduce the amount of memory that these polishing algorithms require when polishing a large genome assembly. It is still true that these algorithms cannot "scale well" (i.e., cannot fit into the available memory that we have in our servers: 192GB) to polish large genomes but they can still polish them when we use smaller data sets or polish the large genome assembly contig by contig in multiple runs.

BoredMa commented 5 years ago

@canfirtina Thanks very much.

CMU-SAFARI / Apollo

Question About Accuracy? #4