NBISweden / Earth-Biogenome-Project-pilot

Assembly and Annotation workflows for analysing data in the Earth Biogenome Project pilot project.
https://www.earthbiogenome.org/
GNU General Public License v3.0
12 stars 8 forks source link

What analyses should be run on the various states of assembly. #22

Open mahesh-panchal opened 2 years ago

mahesh-panchal commented 2 years ago

IPA produces:

HiFi asm produces:

What analyses should we be running on which files?

Quast, Busco, Blobtools, Merqury, Inspector [ *.bp.p_ctg.fasta, *.purged.primary.fasta ] Bandage [ *.bp.p_ctg.gfa ] Merqury [ *.bp.hap1.p_ctg.fasta+*.bp.hap2.p_ctg.fasta ] ?

Inspector is still being evaluated, but for now we'll include it anyway.

@iggyB What are you currently running on these outputs? @aersoares81 Any other opinions on what we should be analyzing here.

What should we be doing with the haplotigs and unitigs?

iggyB commented 2 years ago

@mahesh-panchal your list covers well current tools used for assembly QC on our site. In addition I sometimes check content of r_utg.

We could do some testing of purge_dups by using HiFi data and Omni-C data. Perhaps it would make sense to do manual run after each assembly. But as said previously, we can make decisions after kmer analysis and BUSCO.

aersoares81 commented 2 years ago

(gosh, I'm late to this) I think it pretty much covers it, except we could replace Quast with our own stats scrips that we use pre-annotation.

mahesh-panchal commented 2 years ago

What's in the annotation scripts that is not in Quast? Quast also produces some nice plots and is also compatible with MultiQC. I'd like to be able to make a single report at the end of a run, so it's all in one place.

mahesh-panchal commented 2 years ago

@iggyB What do you mean by:

In addition I sometimes check content of r_utg.

What are you checking for in the unitig graph? Trimmed sequence?